Add functional tests for ILM & Data Stream Lifecycle

axw commented 1 month ago

We should implement functional tests that verify ILM or Data Stream Lefecycle Management is used as expected according to user configuration, and depending on whether a cluster is created new or upgraded. See https://github.com/elastic/apm-server/issues/13898#issuecomment-2326277518

endorama commented 6 days ago

I've been thinking how to implement functional tests. I explored 3 possible approaches:

as part of smoke tests
as part of system tests
with a new approach independent of the 2 above

TL;DR: my conclusion is that we may want to proceed with option 3.

Option 1 heavy leverages ESS, creating infrastructure using terraform, on top of a custom built Bash framework. We do not need leveraging ESS for functional testing and the framework looks built with that purpose. Moreover Bash flexibility comes at a great readability costs. Orchestration is also not clear to me, and documentation is lacking.

Option 2 uses Go tests, but is strictly focused on testing specific APM server behaviors. As we now rely on apm-data plugin we may need to test interactions with different Elasticsearch versions and more complex interactions, which do not fit the current model (that starts a single stack against which runs all tests).

A third option would be to build a framework that looks similar to Option 2, with guarantees provided by Go, but with a scope more similar to Option 1. I created a simple stub in https://github.com/endorama/apm-server/blob/af3ca3744e4746d4c6d7f65a162927d6c9e19331/functionaltests/main_test.go#L31 using testcontainers library and starting a specific Elasticsearch version in a container. The idea here would be to build the freedom to leverage single stack components, allowing us to express complex upgrade and assertion scenarios. I have some concerns on this approach:

we would duplicate packages already existing in systemtests, but that looks like are lacking the flexibility we would need to express complex cases
we would duplicate some logic of smoke tests, but relying on docker would allow faster testing and using Go easier maintenance in the long term

The third option looks the best to me if I think at future use cases (es adding further cases to this logic https://github.com/elastic/apm-server/pull/13678 is more difficult in Bash than in Go), and I think a test framework must be ergonomic enough to encourage use.

I see potential for convergence in the long run, but is out of scope and not sure how much weight should have in the decision.

inge4pres commented 5 days ago

depending on whether a cluster is created new or upgraded.

The upgrade part is going to be especially tricky IMO, because IIANM Elasticsearch will never allow an upgrade between a released version (e.g. 8.15.3) and an un-released one (SNAPSHOT). If this is correct, I have no idea on how we could run the test for an upgrade before a release is created.

endorama commented 5 days ago

Can we use BCs in cloud first region? I'm not sure is possible to upgrade to those though.

endorama commented 5 days ago

I'll recap the discussions from today about how to move forward. I discussed this with @axw, @1pkg and @inge4pres.

My current stub uses testcontainers, but I was not aware that we had flakiness issues with it in systemtests, so it does not look a great path forward. Additionally, Andrew noticed that our customers mostly use ESS, not some Docker/Compose stack, and there are benefits in testing there.

Leveraging current smoke tests does not look the preferred path forward, as they are mostly Bash + CI and this greatly limits both expressiveness sin tests and reproducibility.

The current proposal would be to implement a new testing framework built on these principals:

we separate the 3 main layers we need: orchestration, fixtures and assertions
orchestration will be Terraform + Bash based; transient modifications for testing will happen outside Terraform, produce some configuration drift but that's not considered relevant (as infrastructure is expected to be cleaned up after tests)
fixtures and assertions will be Go bases, and developed as reusable components we can use for both apm-server and serverless.

This approach would go towards the convergence mentioned in my previous comment, and not using testcontainers would help with converging earlier. The layers mentioned about should "swappable", so that we can mix infrastructure/fixtures/assertions as needed based on testing scenarios. This could potentially also extends to running tests with a Docker stack or ECK, but is out of scope at the moment.

We will also have to consider how to run tests in parallel, some tests may taint the Elasticsearch stack used in a way that does not make safe reusing it for other tests, and some may not. Is not clear how to address this in our design at the moment, but for efficiency would be interesting to be able to mark clusters as tainted for further tests reuse.

Regarding which tests cases to run, we have a set already mentioned in https://github.com/elastic/apm-server/issues/13898#issuecomment-2326277518 that we should include. Additionally we should include testing upgrade path from versions before 8.13.0 to 8.15 and 8.16.

endorama commented 4 days ago

As per our latest discussion, I created a stub of a first test on the new framework we discussed. You can see it here: https://github.com/endorama/apm-server/blob/3b4ec398e8715b9b61ede38cb84aa5928d241492/testing/functional/test1/main_test.go

The first test I'm aiming for is testing the upgrade path from 8.14.0 to 8.15.1 as defined by:

Upgrade 8.14.x to 8.15.1+ with defaults: ILM should continue to be used for old indices, DLM should be used for new indices

I also added a README to clarify the overall idea.

elastic / apm-server

Add functional tests for ILM & Data Stream Lifecycle #14100