elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.05k forks source link

[Security Solution] Collect ideas for reducing flakiness and increasing speed of Cypress tests #153655

Open banderror opened 1 year ago

banderror commented 1 year ago

Epic: https://github.com/elastic/kibana/issues/153633

Summary

Our E2E Cypress tests continue to be flaky and are slow both locally and on CI. Let's approach this problem by first creating a list of ideas for what we can do to fix this.

Please add your ideas in the comments. I'll be adding them to the list below.

I'll start with a list of ideas we wrote down while planning the 8.8 release cycle.

Ideas

elasticmachine commented 1 year ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine commented 1 year ago

Pinging @elastic/security-solution (Team: SecuritySolution)

banderror commented 1 year ago

There were some ideas posted in https://github.com/elastic/kibana/pull/141069, some related to flakiness and performance, and some to maintainability:

PhilippeOberti commented 1 year ago

Tests should be isolated from each other. One test should not depend on anything done by another test

on the same note, I would add that leveraging test/isolation when possible would drastically lower test run time. This requires to make sure each test cleans after itself to go back to the start that it was in when it started. This guarantees that each test can be run individually.

PhilippeOberti commented 1 year ago

We could also add that running the each test multiple times locally can bring to light flakiness. We can leverage Cypress'n times method to do that. In the past I've used this and ran tests 25, 50 or even 100 times (which can be very time consuming for developers..., but the higher the more confident we are about non-flakiness)

MadameSheema commented 1 year ago

Here some thoughts about the presented ideas:

  • By default, rules should be created in a disabled state. Only some tests would need a rule to be enabled, e.g. tests for the rule execution logic. These tests will need to enable rules explicitly.

For the cases where we need to have alerts generated, we should check if is faster to create the rule enabled by default or to create the rule disabled and then enable it.

  • Real prebuilt rules (700+ rules) should be installed only in a very limited number of E2E tests verifying prebuilt rule workflows. All other tests should either create custom rules or create a short number of mock prebuilt rules.

100% agree with this since this is making some tests painfully slow.

-Fleet packages should be installed only in a very limited number of E2E tests.

100% agree

  • CSS selectors: we should leverage data-test-subj attributes instead of by class names, ids, and tags.

100% agree

  • For data setup, we should always prefer using API calls instead of actions made in the UI.

100% agree. I would also add: To minimize the use of es_archiver. Try to use the default data (the one that is loaded at the very beginning), if this set of data is not enough, create a really small archive, the smaller the better (1 doc or 2 if possible).

  • Cleanup should be done in before/beforeEach instead of after/afterEach blocks.

100% agree. cleanKibana should be used JUST on before hooks (not beforeEach) and when necessary, this is because this cleanup is a bit expensive in execution time. We might consider also removing cleanKibana to force people to clean just the things they need to make sure they start in the clean state they want.

  • Tests should be isolated from each other. One test should not depend on anything done by another test.

100% agree. Following a bit above @PhilippeOberti's comment. Since the Cypress upgrade version done in x-mas, Cypress by default cleans the browser state forcing us to re-visit the page.

Revisiting the page each time is a huge cost from the execution perspective. There are some specific situations/casuistics where we don't need to clean the state of the browser each time, and the test will still be isolated/independent without having to reload the page again and again. Happy to discuss this point if you don't agree :)

  • Best practices regarding reusable helpers to manipulate table(s), filters and etc to make the tests much more readable (including transparent folders and files structure)
  • Tests should be driven by code instead of mock data. They should do only a minimal number of actions to achieve the result.

Yup! yup!! :)

@PhilippeOberti regarding flakiness, I believe it is better if we use the flaky test runner.

@banderror this is what I wrote on the Cypress readme about these topics:

### Write easy to maintain tests
Consider to extract all the elements you need to interact with to the `screens` folder. In this way in case the locator changes, we just need to update the value in one place.

### Write easy to read tests
Consider to extract all the tasks a user should perfom into the `tasks` folder. In this way is going to be easier to undertsand what are we trying to mimic from the user perspective. Also in case there is change on the way the user has to perform the action, we just need to update the value in one place.

### Make sure your test fails
Before open a PR with a new test, please first make sure that the test fails. If you never see your test fail you don’t know if your test is actually testing the right thing, or testing anything at all.

### Minimize the use of es_archive
When possible, create all the data that you need for executing the tests using the application APIs.

### Speed up test execution time
Loading the web page takes a big amount of time, in order to minimize that impact, the following points should be
taken into consideration until another solution is implemented:

- Group the tests that are similar in different contexts.
- For every context login only once, clean the state between tests if needed without re-loading the page.
- All tests in a spec file must be order-independent.
- Clean up the state and data just when needed using `cleanKibana` function.  Executing this function takes a lot of time, so consider if you really need to clean the data before the execution. I.e: If you are just checking that a modal can be opened, you may not need to clean the data.

Remember that minimizing the number of times the web page is loaded, we minimize as well the execution time.

To all the above I would also add:

MadameSheema commented 1 year ago

About increasing speed, currently we have a test file that takes around 20 minutes to be executed on CI, we should consider a certain amount of time as a red flag of something that probably can be done in a different way.

Screenshot 2023-03-24 at 18 37 02 (1)
marshallmain commented 1 year ago
  • By default, rules should be created in a disabled state. Only some tests would need a rule to be enabled, e.g. tests for the rule execution logic. These tests will need to enable rules explicitly.
  • By default, rules should have a large enough interval so they run only once during test execution.

I would add a third idea following these 2, that we should prefer to load an alerts index in most cases where we want to test alerts functionality, such as opening/closing alerts, viewing the details, etc, instead of creating and running a real rule to generate the alerts. There are a couple benefits to this approach: (1) running the rules is slow, so we'd speed things up, (2) reducing the number of steps in each test also makes the tests easier to read/maintain/debug when something goes wrong, and (3) the UI has to support alerts from previous versions anyway, so having a static archive of alerts from some old 8.x version could help verify our UI supports old alerts. We have some ES Archives of alerts from legacy version already, and somewhere (I'll have to search for it later) there's a tool Madi created to generate more of these fixtures for different Kibana versions.

Separately, I would like to reduce the usage of force: true in Cypress tests. I see force: true in the same way as type casting in TypeScript - useful in some cases, so a complete ban on usage is going too far, but it should ideally be used sparingly and with a comment on every usage explaining why force: true is necessary in that spot. In order to reduce usage of force: true I think we'll have to fix the corresponding elements in the UI, so this problem is not isolated to the tests themselves - but IMO it's a high priority to address and will make writing new Cypress tests easier if devs don't have to go through painful trial and error adding force: true in the right places and figuring out if it's hiding a real bug or not.

MadameSheema commented 1 year ago

I would add a third idea following these 2, that we should prefer to load an alerts index in most cases where we want to test alerts functionality, such as opening/closing alerts, viewing the details, etc, instead of creating and running a real rule to generate the alerts. There are a couple benefits to this approach: (1) running the rules is slow, so we'd speed things up, (2) reducing the number of steps in each test also makes the tests easier to read/maintain/debug when something goes wrong, and (3) the UI has to support alerts from previous versions anyway, so having a static archive of alerts from some old 8.x version could help verify our UI supports old alerts. We have some ES Archives of alerts from legacy version already, and somewhere (I'll have to search for it later) there's a tool Madi created to generate more of these fixtures for different Kibana versions.

I would first check which way is faster, the es_archiver loading tends to be a bit slower. I would do a POC with the different solutions for the same test and go for the fastest one. If the fastest solution is the archiver one, we need to take it into consideration and have in mind to update the archives when there is a major change.

If we want to test legacy versions, absolutely that we need an archive here, in this case, as I said previously, ideally with just one doc to minimize the loading time.

Separately, I would like to reduce the usage of force: true in Cypress tests. I see force: true in the same way as type casting in TypeScript - useful in some cases, so a complete ban on usage is going too far, but it should ideally be used sparingly and with a comment on every usage explaining why force: true is necessary in that spot. In order to reduce usage of force: true I think we'll have to fix the corresponding elements in the UI, so this problem is not isolated to the tests themselves - but IMO it's a high priority to address and will make writing new Cypress tests easier if devs don't have to go through painful trial and error adding force: true in the right places and figuring out if it's hiding a real bug or not.

I agree with this. The use of {force: true} started because when a "regular" click was performed into an element, Cypress throw an error about the element being detached from the Dom, using the flag performs the click and does not throw the error. The problem here is the use of it has been done lately by default without thinking about the consequences.

banderror commented 1 year ago

Tests should be isolated from each other. One test should not depend on anything done by another test

on the same note, I would add that leveraging test/isolation when possible would drastically lower test run time. This requires to make sure each test cleans after itself to go back to the start that it was in when it started. This guarantees that each test can be run individually.

@PhilippeOberti Could you expand on this a bit? Are you saying we should disable test isolation (testIsolation: false)? Here the tradeoffs are explained. Having the goal of reducing flakiness, I feel that disabling it might lead to the opposite result.

banderror commented 1 year ago

We could also add that running the each test multiple times locally can bring to light flakiness. We can leverage Cypress'n times method to do that. In the past I've used this and ran tests 25, 50 or even 100 times (which can be very time consuming for developers..., but the higher the more confident we are about non-flakiness)

@PhilippeOberti Thanks, that's an interesting suggestion. I guess one would primarily use it locally when working on a test to make sure it's stable? Do you mind sharing a code example? How the development workflow would look like compared to using the flaky test runner?

banderror commented 1 year ago

Revisiting the page each time is a huge cost from the execution perspective. There are some specific situations/casuistics where we don't need to clean the state of the browser each time, and the test will still be isolated/independent without having to reload the page again and again. Happy to discuss this point if you don't agree :)

@MadameSheema Definitely can't say I agree haha, but curious to learn more about your thoughts on that.

banderror commented 1 year ago

For the cases where we need to have alerts generated, we should check if is faster to create the rule enabled by default or to create the rule disabled and then enable it.

@MadameSheema Yeah, and creating an enabled rule right away would be faster. What I meant is that the test should explicitly specify that this rule is enabled before calling the create endpoint. Pseudo-code:

createCustomQueryRule({
  name: 'Will run right away',
  enabled: true,
});

So most of the tests would not specify it thus getting a disabled rule by default:

createCustomQueryRule({
  name: 'Will be disabled',
});
banderror commented 1 year ago

I would also add: To minimize the use of es_archiver. Try to use the default data (the one that is loaded at the very beginning), if this set of data is not enough, create a really small archive, the smaller the better (1 doc or 2 if possible).

@MadameSheema +++, I'll add it to the list. There's a good old issue related to that, search for "archives" and "Large input data sets".

cleanKibana should be used JUST on before hooks (not beforeEach) and when necessary, this is because this cleanup is a bit expensive in execution time.

Agree 👍 Will add it to the list.

We might consider also removing cleanKibana to force people to clean just the things they need to make sure they start in the clean state they want.

I'd measure the impact of cleanKibana first before doing that. I guess it shouldn't be a monster and shouldn't do cleanup actions irrelevant for most of the tests. If it's light and minimal, I'd keep it. Otherwise, I'd extract from it optional cleanup actions as reusable functions.

A behaviour should be tested on the lowest layer of the testing pyramid that is possible. Unit > API > Cypress (Most of the behaviour should be covered by unit, API, component testing... we should use Cypress only for the most complex scenarios)

90% agree with that, except I'd say "most of the behavior should be covered by integration tests" 😬 Having 10000 unit tests won't make your app stable if it's not well architected and actual business logic in integration is not covered by tests. ~Let the holy war begin!~

Some thoughts on that from the guy who created React Testing Library: https://kentcdodds.com/blog/the-testing-trophy-and-testing-classifications

banderror commented 1 year ago

I would add a third idea following these 2, that we should prefer to load an alerts index in most cases where we want to test alerts functionality, such as opening/closing alerts, viewing the details, etc, instead of creating and running a real rule to generate the alerts. There are a couple benefits to this approach: (1) running the rules is slow, so we'd speed things up, (2) reducing the number of steps in each test also makes the tests easier to read/maintain/debug when something goes wrong, and (3) the UI has to support alerts from previous versions anyway, so having a static archive of alerts from some old 8.x version could help verify our UI supports old alerts. We have some ES Archives of alerts from legacy version already, and somewhere (I'll have to search for it later) there's a tool Madi created to generate more of these fixtures for different Kibana versions.

@marshallmain ++++, fully agree if we're talking about the features you mentioned. I'd add one more benefit: if rule execution logic breaks, these tests won't fail with a false negative result, and will still be testing the alerts management functionality. I think this approach is quite universal: if you need alert documents to test the alerts table, generate fake alerts instead of creating and running rules that would generate real alerts; if you need to test the rules table, create fake prebuilt rules instead of installing a Fleet package with real prebuilt rules; especially don't use UI for that, because that's how you start testing areas of the app completely irrelevant for the given test. Which increases flake and false negatives.

Regarding ES archiver, agree with @MadameSheema that by default we should avoid it. In the case of alerts generation for data setup, we could have a bunch of fake alerts in the code, they would be strongly typed because we have the schema, and would be indexed from the code. In some cases like testing the app with historical alerts data, we could leverage ES archives.

Don't personally have a strong opinion on {force: true} but will add it to the list.

Thanks all for your suggestions ❤️

PhilippeOberti commented 1 year ago

on the same note, I would add that leveraging test/isolation when possible would drastically lower test run time. This requires to make sure each test cleans after itself to go back to the start that it was in when it started. This guarantees that each test can be run individually.

@PhilippeOberti Could you expand on this a bit? Are you saying we should disable test isolation (testIsolation: false)? Here the tradeoffs are explained. Having the goal of reducing flakiness, I feel that disabling it might lead to the opposite result.

you are absolutely correct, in terms of reducing flakiness, using { testIsolation: false } isn't the right move. I was mentioning this because flakiness is only one of the challenges we're facing right now, speed being the other. While this ticket only focuses on flakiness, I don't want us to make decisions on guidelines for writing Cypress tests without having a broader picture. In some files, using { testIsolation: false } reduces execution time by 4-5x, sometimes even more (I've seen many places where executing a file went from 50+ seconds to under 15 seconds).

Just want to make sure we're taking that into account :)

banderror commented 1 year ago

In some files, using { testIsolation: false } reduces execution time by 4-5x, sometimes even more (I've seen many places where executing a file went from 50+ seconds to under 15 seconds).

@PhilippeOberti Oh, I didn't realize it's possible to override it for individual tests! Is this how you do it? And yeah, this optimization could make sense for some of the tests -- could you share a few examples where it reduced the test execution time?

PhilippeOberti commented 1 year ago

@PhilippeOberti Oh, I didn't realize it's possible to override it for individual tests! Is this how you do it? And yeah, this optimization could make sense for some of the tests -- could you share a few examples where it reduced the test execution time?

I've only used it at the top level describe, though using it at the test level sometimes makes sense as well. This PR is an example of some refactor I recently did and had a 40-50% reduce run time. You'll see there also how I clean up each test so that the next one starts with the same state.

PhilippeOberti commented 1 year ago

We could also add that running the each test multiple times locally can bring to light flakiness. We can leverage Cypress'n times method to do that. In the past I've used this and ran tests 25, 50 or even 100 times (which can be very time consuming for developers..., but the higher the more confident we are about non-flakiness)

@PhilippeOberti Thanks, that's an interesting suggestion. I guess one would primarily use it locally when working on a test to make sure it's stable? Do you mind sharing a code example? How the development workflow would look like compared to using the flaky test runner?

This isn't necessarily ideal but it's a quick way to check if a test is flaky or not. In the past job we were requiring a screenshot in the description or each PR that was purely focusing on Cypress tests, to show that the tests have successfully ran a bunch of times with no failure... a bit like I did for this PR. I wasn't aware of the flaky test runner so I've never used it. I feel like this runner is more for after the fact (meaning the PR is already open). What I'm suggesting would be done during development on our local machines. So it can help catching things a bit earlier...

To use it it's pretty simple, just wrap a specific test as follows:

describe('Description', () => {
  Cypress._.times(50, () => {
    it('should do this', () => {
      //...
    });
  });
});

this would run the test 50 times.

Not saying we should use this, it's just a nice, quick and easy way to verify things.

banderror commented 1 year ago

Thank you for your suggestions @PhilippeOberti, I think it all makes sense 👍 I'll add both to the list.

michaelolo24 commented 1 year ago

Just adding a note that testing on our M1 Mac's could provide alternate results to what CI does, so it might be worth providing guidance around using stress or some other tool to simulate a system under load

vitaliidm commented 1 year ago

Additional thoughts:

oatkiller commented 1 year ago

Having 10000 unit tests won't make your app stable if it's not well architected and actual business logic in integration is not covered by tests.

@banderror Perhaps refactoring some code to make it easier to test with unit tests could be considered.

jpdjere commented 1 year ago

When working with Prebuilt Rules on Cypress tests, avoid installing the whole security_detection_engine package wherever possible. Its installation is a big cause of flakyness and many of the tests we had been installing it don't actually need the whole package.

Instead:

  1. prevent the installation of package that happens in the bakcground when the user navigates to Rules Management or Add Elastic Rules page
  2. create mock security-rule assets
  3. install them - optionally, depending on the use case.

All of this logic has been encapsulated into a single utility function that can be used: createAndInstallMockedPrebuiltRules https://github.com/elastic/kibana/pull/155241/files#diff-169d96784f5cd0bcb2e2ba73809595ece4e56534bc447d02fababdc1cdc41e9eR186 (PR pending merge)

jpdjere commented 1 year ago

Avoid usage of esArchiverResetKibana wherever possible, prefer cleaning the state using other utilites.

While its usage should be safe to use, it has caused flakyness issues in unrelated test suites which need data loaded into indeces through the esArchiverLoad helper. If this happens, it is usually an indication that that test's setup is incorrect, as using esArchiverResetKibana in another test suite should not affect it if it's correctly set up.