☂️ Enhance and Stabilise Druid E2E tests

unmarshall commented 5 months ago

What you would like to be added:

[ ] Use separate namespace for running e2e tests concurrently (using go-native tests)
[ ] Load the images using kind load as this currently takes a long time during setup.
[ ] Remove ginkgo with native golang tests. We are removing ginkgo usage from druid. A lot of unit and some IT tests have already been migrated and will be merged with #777 .
[x] Use KO to build images so that these are faster.
[ ] For tests that have failed, preserve their namespaces so that developers can debug. For tests that have passed cleanup the respective namespaces.
[ ] Do not stop the kind cluster at the end of the test run. Or at least have an option to not do that. For concourse pipeline mandatory cleanup is required but for local runs where an ability to analyse failure is required there we can switch it off.
[ ] Have capability to put breakpoint to any test and enable debugging from the IDE.
[ ] Should be able to do fast iterations with quasi hot deploy (golang unfortunately does not support real hot-deploy).
[ ] Record and publish (as logs) startup times for etcd clusters. Ideally these should be recorded as metrics for all clusters managed by druid on dev/staging/canary/live landscapes. This will help understand any deterioration in the startup times across releases.
[ ] Need to add proper compaction and copy-backups-task testing to the e2e test suite.
[ ] e2e tests currently test only backup-enabled etcds, but not test backup-disabled etcds (such as etcd-events in g/g).
[ ] Remove flakiness in tests - there is still some flakiness even after the fix that we made yesterday, and such flakes need to be removed to have deterministic test runs.
[ ] #807
- [ ] Add to CI pipeline with PR branch vs master branch (or previous release)
[ ] Test backward compatibility to previous druid version (support for downgrade)
[ ] Test reconciliation after error in previous reconciliation which had caused the etcd cluster to be unready. This test would catch cases such as the one described in #818
[ ] Generate all PKI artifacts to be used for e2e tests. This utility should be re-used for any tests (other than e2e tests) that require PKI artifacts.
[ ] https://github.com/gardener/etcd-backup-restore/issues/762

Motivation (Why is this needed?): E2E tests should be simple, comprehensive, fast and stable.

shreyas-s-rao commented 2 months ago

Manual tests that I generally run before merging large PRs, cover different combinations of druid auto-reconcile enabled, single/multi node etcds, backups disabled/enabled (with different providers), TLS disabled/enabled, etc, for various scenarios like:

etcd creation
reconciliation
spec changes
scale-up of replicas (with different combinations of TLS disabled/enabled)
hibernation/unhibernation (scale down to 0 and back up to original replicas)
upgrade of druid from old to new version (with checks for etcd status reconciliation, and later spec reconciliation)
compaction jobs
copy-backups tasks

Ex: list of manual tests executed before merging #777

| TEST NAME | Druid Auto-Reconcile | Single/Multi Node | Backups (provider) | Etcd Client TLS | Etcd Peer TLS | EtcdBR TLS | TEST RESULT | | ------------------------------------------------------------------------------------------------------- | -------------------- | ----------------- | ------------------ | --------------- | ------------- | ---------- | ----------- | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Single | NA | FALSE | FALSE | FALSE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Single | NA | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Single | AWS | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | NA | FALSE | FALSE | FALSE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | NA | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | AWS | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | GCP | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | Azure | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | Openstack | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | FALSE | Multi | Local | TRUE | TRUE | TRUE | TRUE | | Perform etcd spec changes, check if reconciliation triggered | FALSE | Multi | GCP | TRUE | TRUE | TRUE | TRUE | | Scale-up etcd from single-node non-TLS to multi-node non-TLS, hibernate, unhibernate | FALSE | Single | GCP | FALSE | FALSE | FALSE | TRUE | | Scale-up etcd from single-node non-TLS to multi-node TLS, hibernate, unhibernate | FALSE | Single | GCP | FALSE | FALSE | FALSE | TRUE | | Scale-up etcd from single-node TLS to multi-node TLS, hibernate, unhibernate | FALSE | Single | NA | TRUE | TRUE | TRUE | TRUE | | Upgrade druid from master to #777, check status updates, add reconcile annotation, check reconciliation | FALSE | Multi | GCP | TRUE | TRUE | TRUE | TRUE | | Deploy etcdcopybackupstask, check success | FALSE | Multi | Local | TRUE | TRUE | TRUE | TRUE | | Configure compaction with low threshold, populate etcd, check if compaction jobs are triggered and run | FALSE | Single | AWS | TRUE | TRUE | TRUE | TRUE | | Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd | TRUE | Multi | GCP | TRUE | TRUE | TRUE | TRUE | | Perform etcd spec changes, check if reconciliation triggered | TRUE | Multi | GCP | TRUE | TRUE | TRUE | TRUE |

unmarshall commented 1 month ago

https://github.com/gardener/etcd-druid/pull/833 introduced namespace separation but this will be completely re-written.

gardener / etcd-druid

☂️ Enhance and Stabilise Druid E2E tests #782