[Security Solution] Preparing Cypress for the second quality gate - latest steps

MadameSheema commented 9 months ago

Context

We have already done a lot of work around Cypress to prepare the tests to be executed successfully on the second quality gate:

We have created specific executions per area team in buildkite to improve the visibility and ownership of the tests in case of failure
We have created the logic to mimic our current buildkite executions on MKI
We have clean code in order to have more reliable and robust tests:
- https://github.com/elastic/kibana/pull/170636
- https://github.com/elastic/kibana/pull/169563
- https://github.com/elastic/kibana/pull/172140
- https://github.com/elastic/kibana/pull/172472
- We have provided developers a easy way to use Cypress with MKI in their local machines
- We have implemented the ability to perform login in serverless using SAML

With all the above we are extremely close to have our tests ready to be integrated with the second quality but there is still some work pending to be finished before the integration.

Final Steps

We need to take into consideration that any failure on the second quality gate is going to block a deployment to production. This is why we need to be extremely careful about the robustness of our tests. To guarantee that our tests are robust and we minimise the risk of blocking releases to production the next actions should be performed:

Skip flaky or non-working tests on MKI
Guarantee the retrievability of our tests
Have several green executions in a row without flakiness
Integration with the quality gate

Skip flaky or non-working tests on MKI

Before integrating our tests with the Kibana release quality gate, we need to make sure we have green execution. We want to arrive to that point as soon as possible so any failing or flaky which requires investigation will be skipped from the execution with a @brokenInServerlessQA label.

Guarantee the retrievability of our tests

We all know that flakiness may happen from time to time, ideally, the only flakiness that we should face, is the one regarding external factors as slow machines or network issues.

With Cypress we have the test retries functionality enabled. Test retries has been configured with 1 retry attempt, Cypress will retry a failed test an additional time (for a total of 2 attempts) before potentially being marked as a failed test. When a test is re-executed, the each hooks will be re-run as well, however, failures in before and after hooks will not trigger a retry and the test will be marked as failure.

So in order to have 'retriable' tests, we should get rid off the before and after hooks in favor of the beforeEach and afterEach hook. Or at least make sure that the code executed in the before and after hook is not prone to fail (i.e. es_archiver).

Another thing we need to take into consideration to guarantee that a test can be retried is to make sure that the data that the test might generate is properly cleaned.

Each spec file is executed on a clean environment, but, retries are not. Retries are executed on the same environment the execution was initiated, this is why is pretty important to make sure that the data the test may generate is cleaned at the beginning.

Have several green executions in a row without flakiness

We cannot integrate tests until we have several green executions in a row.

Integration with the quality gate

Once we are sure that our tests are consistently passing on MKI, it will be integrated with the quality gate. Take into consideration that currently we have the executions splitted by area teams, so as soon as an area team has their tests ready, those will be integrated.

Tasks to be done

[ ] https://github.com/elastic/kibana/issues/173508
[ ] Investigations
- [x] Skip non-working tests
- [x] https://github.com/elastic/kibana/issues/175019
- [ ] https://github.com/elastic/kibana/issues/175095
- [x] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [ ] https://github.com/elastic/kibana/issues/180282
[ ] Explore
- [x] Skip non-working tests
- [x] https://github.com/elastic/kibana/issues/175020
- [ ] https://github.com/elastic/kibana/issues/175096
- [x] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [ ] https://github.com/elastic/kibana/issues/180283
[ ] Detection Engine
- [ ] https://github.com/elastic/kibana/issues/169185
- [x] https://github.com/elastic/kibana/issues/175021
- [ ] https://github.com/elastic/kibana/issues/175096
- [ ] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [ ] https://github.com/elastic/kibana/issues/180277
[ ] Rule management
- [x] Skip non-working tests
- [x] https://github.com/elastic/kibana/issues/175022
- [ ] https://github.com/elastic/kibana/issues/175098
- [x] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [ ] https://github.com/elastic/kibana/issues/180278
[ ] Entity Analytics
- [x] Skip non-working tests
- [x] https://github.com/elastic/kibana/issues/175023
- [ ] https://github.com/elastic/kibana/issues/175099
- [x] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [x] https://github.com/elastic/kibana/issues/180281
[ ] AI Assistant
- [x] Skip non-working tests
- [x] Remove before and after hooks
- [x] Make sure data is cleaned
- [x] Make sure tests are stable in MKI
- [ ] Integrate tests with the quality gate
- [ ] https://github.com/elastic/kibana/issues/180280

elasticmachine commented 9 months ago

Pinging @elastic/security-solution (Team: SecuritySolution)

banderror commented 7 months ago

@MadameSheema Just to double-check my understanding, is https://github.com/elastic/kibana/issues/173508 supposed to be worked on by the Eng. Productivity team, and tickets like these by each development team for the tests they own?

Skip non-working tests
https://github.com/elastic/kibana/issues/175022
https://github.com/elastic/kibana/issues/175098
Make sure tests are stable in MKI
Integrate tests with the quality gate

How do we "Make sure tests are stable in MKI" and "Integrate tests with the quality gate", is there any guidance?

Also, many of our tests that run against Serverless in CI (and they are stable) have been marked as @brokenInServerlessQA. I'm wondering why and what needs to be done to fix it.

MadameSheema commented 7 months ago

@banderror please find below my answers.

@MadameSheema Just to double-check my understanding, is https://github.com/elastic/kibana/issues/173508 supposed to be worked on by the Eng. Productivity team, and tickets like these by each development team for the tests they own?

Some tasks should be done by sec-eng-prod teams and some by the area teams since you have more knowledge of your area.

I'm happy to help by giving you guidelines and tips, I will do it myself but I don't have available cycles.

Skip non-working tests

This is something the area teams can work on and I can give support.

https://github.com/elastic/kibana/issues/175022

Same.

https://github.com/elastic/kibana/issues/175098

Same.

Make sure tests are stable in MKI

This can be done by the area teams or us.

Integrate tests with the quality gate

This should be a coordinated effort.

How do we "Make sure tests are stable in MKI"

We have pipelines that trigger the tests using a real MKI environment, currently most of the tests were disabled due to failures, we can enable them and check the progress.

and "Integrate tests with the quality gate", is there any guidance?

This should be a coordinated effort, we can expand more about it in our sync.

Also, many of our tests that run against Serverless in CI (and they are stable) have been marked as @brokenInServerlessQA. I'm wondering why

The tag was added long ago when we were testing the MKI pipelines, we didn't have yet the SAML ready in our tests or the capability of testing with different roles so it might happen that now the tests work correctly.

Take into consideration that the tests on CI are executed using a mocked serverless environment, during the development of the pipelines we have faced slightly differences that made the test fail/

and what needs to be done to fix it.

Open Cypress in visual mode for MKI you can check how to do it in our Cypress readme in the ### Running serverless tests locally pointing to a MKI project created in QA environment (Second Quality Gate). Execute the test make sure that it passes if so, remove the label if not, fix it and remove the label :)

All of the above can be expanded on our sync session.

elastic / kibana