ycombinator commented 4 years ago

Follow up to #64.

Currently the system test runner only supports the Docker Compose service deployer. That is, it can only test packages whose services can be spun up using Docker Compose. We should add more service deployers to enable system testing of packages such as system (probably a no-op or minimal service deployer), aws (probably some way to pass connection parameters and credentials via environment variables and/or something that understands terraform files), kubernetes.

mtojek commented 3 years ago

For reference:

Testing on Kubernetes written by @ChrsMark : https://github.com/elastic/integrations/blob/master/testing/environments/kubernetes/README.md (probably outdated now as there were many changes introduced to Fleet)

mtojek commented 3 years ago

We need to cover following providers:

[x] AWS
[ ] Azure
[ ] CloudFoundry
[ ] GCP
[ ] Kubernetes on GCP
[ ] OpenShift on GCP

(correct me if I missed any of them)

Notes:

I decided to split Kubernetes and OpenShift as I'm not sure if their configuration paths can be unified.
(priority) let's start with providers we should already support - AWS, Azure
I decided to run Kubernetes on GCP not to overload local CPU/mem resources, but we may need to figure out the Docker image distribution model (if we want to use and deploy custom images).

Other use cases:

As a developer I'd like to boot up and access a long-running Kubernetes stack, so I can interactively collect metrics and design Kibana dashboards

Technical observations:

Any DSL seems to be the best fit here. The oldest candidate is Terraform, but it requires the terraform tool to be installed locally.
Pulumi uses programming languages, which is considered as no-go for elastic-package - we don't want people to learn a new language (doesn't matter if it's Go, Javascript or Python)
Research: https://github.com/thazelart/terraform-validator
Research: https://github.com/johandry/terranova https://github.com/johandry/terranova-examples
Research: https://hub.docker.com/r/hashicorp/terraform/

Questions:

Should the provider's stack be spawned similarly to the service's Docker-based stack, for the time of tests execution, or in a long-running manner like the Elastic stack?
Should we provide an option of acquiring authorization data for the internal team?

mtojek commented 3 years ago

@kaiyan-sheng @narph @ChrsMark

Would you mind describing here use cases for AWS, Azure and Kubernetes? I'm looking forward to seeing how these cloud/infra providers can be used for testing integrations.

ChrsMark commented 3 years ago

Thanks for the ping @mtojek, I will try to provide a scenario, with inline comments/thoughts that would cover our k8s needs.

Vanilla Kubernetes

Run elastic-package k8s up to bring up a k8s cluster, I don't think we should care where it is, on GKE or locally on minikube or kind. Maybe it will be better to have it running on GKE for now to avoid an extra step of minikube/kind installation (?). In this step all the required prerequisites should happen, like installing kube_state_metrics from which state_* metricsets will collect metrics from.
Run elastic-package test k8s (the syntax is abstract here for the sake of the example) so as to deploy agent on the running k8s cluster and enrol it to the Elastic Stack. Elastic stack should be running maybe on the same k8s cluster so as to have easier networking configuration to my mind (similar to the approach mentioned at testing on k8s). After the test is completed the cluster is still up and the agents are still shipping metrics. To clean this up we need to run the next command to bring the whole cluster down.
Run elastic-package k8s down which will destroy the cluster including the Elastic stack and Agents.

Note: I think this scenario will can be expanded to test other packages like istio and ingress-controller by adding them as extra flags in step 1.

OCP

This scenario step will be the same as for vanilla k8s, the only difference will be the installation step, where we need an Openshift installation. Here if we want a from scratch installation we need to run the GCP installer which takes ~40 mins to bring the cluster up. Not sure if this can be part of a CI job. Maybe can be a nightly job. Related to https://github.com/elastic/beats/issues/17962. Ping me directly for more info ;).
Same as vanilla k8s but we will need slightly different manifest most probably cause of OCP restrictions.
Same as vanilla k8s, but use the GCP installer script to bring the cluster down.

Note 1: This is only for testing k8s module, but it should be quite similar for testing Autodiscover. Note 2: The Running Agent on k8s thing is not yet completely decided. Progress/discussion happen around this at k8s-agent WP, cc: @blakerouse

kaiyan-sheng commented 3 years ago

For AWS testing, we can use a terraform script(or anything similar) per dataset/package to create AWS services for testing and cleanup after testing. I think we have an AWS account for testing in Beats jenkins (@jsoriano knows more about this) and we can leverage it here.

For metrics: an example can be we can run elastic-package test ec2-metrics locally to apply the terraform script to create an EC2 instance in AWS, wait for a while till EC2 metrics are sent into CloudWatch, check events collected from ec2-metrics package and delete the EC2 instance at the end.

For logs: We have sample files to test the pipelines already but it would be good to have terraform to setup S3-SQS to test the inputs.

There are two use cases here: one is to run this in CI and the other one is for package developers to test locally. Because creating services can be cost-inefficient, we should consider how frequently should we run elastic-package test ec2-metrics in CI?

mtojek commented 3 years ago

There are two use cases here: one is to run this in CI and the other one is for package developers to test locally. Because creating services can be cost-inefficient, we should consider how frequently should we run elastic-package test ec2-metrics in CI?

With this PR https://github.com/elastic/integrations/pull/474 tests will be executed only if the relevant packages are changed (in this case AWS integration) or this is the master branch.

Regarding elastic-package test k8s and elastic-package test ec2-metrics I think we need to come up with an open, flexible API, so that we don't have to modify CLI every time we introduce new platform, but this is something we'll research :) I admit I haven't looked at the k8s as a separate stack, rather as a service under test that is alive for the duration of a test. Keeping it as a separate stack (like Elastic stack) might actually simplify things.

narph commented 3 years ago

for Azure we can look at something similar as the use case above. I previously worked on a POC using Pulumi which will authenticate the user, create a storage account , fetch metrics, validate on them and then remove the entire deployment. I hope it is of interest here https://github.com/elastic/beats/pull/21850. Maybe something like elastic-package test azure storage could replace the entire process. For azure logs, more steps are required, for example after creating the event hub we will have to populate it with some valid/invalid messages. Not sure in how much detail we should go in this issue.

mtojek commented 3 years ago

I'm going with this issue.

mtojek commented 3 years ago

Thank you for all feedback, folks! We had a sync-up with @ycombinator to discuss possible options. Technically - we'll try to implement a generic Terraform based test runner. We wouldn't like to include AWS/Azure/K8s references in the CLI - let's try to make it as generic as possible. The approach will be truly declarative, which is in-line with the original principle (no programming language is required).

Here is a list of action items to help us solve this issue.

Dev changes in package-spec:

[x] Allow for data-stream level _dev/deploy definitions - https://github.com/elastic/package-spec/pull/111 or ~- [ ] Mount extra files for data-stream in runtime (it may prevent from building the image multiple times)~

@ycombinator, I still have doubts which path should we follow. If you have any preferences or see benefits of any of them, please feel free to share.

Changes in elastic-package:

[x] Use _dev/deploy for data stream first (if available) - https://github.com/elastic/elastic-package/pull/228 ~- [ ] Consider shortening the total build time of Docker services (build them at most once)~
[x] Implement tf service deployer - a Docker image, which can execute provided terraform templates or proxy traffic for Elastic-Agent. The docker container will manage the lifecycle of created cloud components (machines, buckets, databases) - https://github.com/elastic/elastic-package/pull/227
[x] Implement tf-extension for AWS, use real credentials, prepare sample tests - https://github.com/elastic/elastic-package/pull/227
[x] Write down the README file describing the new service deployer - https://github.com/elastic/elastic-package/pull/227
[ ] Implement tf-extension for GCP, use real credentials (no tests as we don't have any integration)
[x] Try to implement tf-extension - Kubernetes on GCP, use real credentials, use Kubernetes integration - https://github.com/elastic/elastic-package/issues/239
[ ] Implement tf-extension for Azure, use real credentials, prepare sample tests

Changes in integrations:

[x] Update the dependency on elastic-package in Integrations - https://github.com/elastic/integrations/pull/599
[x] AWS integration: define system tests using tf service deployer - https://github.com/elastic/integrations/pull/599
[x] Kubernetes integration: define system tests using tf service deployer
[ ] Azure integration: define system tests using tf service deployer

ChrsMark commented 3 years ago

Thanks for the heads-up @mtojek! Feel free to reach out to me if you guys have any questions about the k8s specifics since it can be tricky with different components we collect from unlike other clouds where we define a single exposed endpoint.

kaiyan-sheng commented 3 years ago

With this PR elastic/integrations#474 tests will be executed only if the relevant packages are changed (in this case AWS integration) or this is the master branch.

Great, thank you!

ycombinator commented 3 years ago

Thanks for the write up and breakdown of tasks, @mtojek. Very helpful!

Dev changes in package-spec:

[ ] Allow for data-stream level _dev/deploy definitions or

[ ] Mount extra files for data-stream in runtime (it may prevent from building the image multiple times)

@ycombinator, I still have doubts which path should we follow. If you have any preferences or see benefits of any of them, please feel free to share.

I recall discussing the first option (Allow for data-stream level _dev/deploy definitions) in our meeting today but not the second one (Mount extra files for data-stream in runtime (it may prevent from building the image multiple times)). Would you mind explaining some details about the second option? Thanks.

mtojek commented 3 years ago

(I came up to this point based on observing the Zeek integration)

I can elaborate on this. Imagine we have an integration XYZ with data streams A, B, C, ... Z. Every data stream is basically the same Docker image with terraform executor and own set of static tf templates. The improvement is to use a single Docker image and simply mount (switch) templates for the data stream test scenario. This way it will be faster than building new Docker image for a data stream.

ycombinator commented 3 years ago

I always assumed (but probably didn't make it explicit, sorry!) that there would be one shared/common TF executor Docker image that is used by the TF service deployer. The definition and maintenance of this image is the responsibility of elastic-package developers, as opposed to that of package developers.

The part that varies is the TF templates, whether those come from the package-level ({package}/_dev/deploy/tf/...) or the data stream-level ({package}/data_stream/{data stream}/_dev/deploy/tf/...). The definition and maintenance of this image is the responsibility of package developers.

So I think we're on the same page?

mtojek commented 3 years ago

The part that varies is the TF templates, whether those come from the package-level ({package}/_dev/deploy/tf/...) or the data stream-level ({package}/data_stream/{data stream}/_dev/deploy/tf/...). The definition and maintenance of this image is the responsibility of package developers.

I agree with the rest of your comment. Regarding the quoted paragraph - what is the best of processing these TF templates (belonging to particular data-streams)? Load them in the runtime? Include them in the build time (one image build per data stream)?

(I think we're on the same page, just confirming the implementation details :)

ycombinator commented 3 years ago

Load them in the runtime? Include them in the build time (one image build per data stream)?

There is also a third option: include all of them at image build time (so you are not building one image per data stream) and then select the right data stream's templates at runtime.

At any rate, I don't know if there's an obvious answer to this one. I would suggest trying one of the options, probably the one you think is simplest to implement, see how well it performs and then iterate from there as necessary.

jsoriano commented 3 years ago

+1 to implement this as a generic declarative Terraform-based runner :+1:

Some comments in case they are helpful:

In https://github.com/elastic/beats/pull/17656 Blake extended mage goIntegTest for Metricbeat to be able to run tests in Kubernetes (with kind) apart of the usual docker compose. There it was also done in a generic way, one provider or the other were used depending on the available files. A similar approach could be followed here to continue supporting docker-compose, or if we want to support other providers in the future.
A use case I think that can be powerful for some complex scenarios is the use of Kubernetes operators to provide testing scenarios. Operators make complex deployments easier. I have used https://strimzi.io/ in the past to deploy Kafka clusters with different authentication methods (something quite painful and error-prone to do from scratch), and https://kubecf.io/ to deploy Cloud Foundry clusters (best method I have found so far to reliably deploy CF in an automated way). The usual approach for them is to create a namespace, install them with helm there, and use them with custom resources. In principle all of this would be supported by Terraform (namespace, helm, and custom resources), but I would like to see it validated when this is implemented :slightly_smiling_face:
Common code can be tricky with Terraform. This is more an implementation detail, but something to take into account because this is going to give problems at some point. You will find cases where sharing some Terraform code between scenarios will be useful (or even needed). For example when defining the providers, depending if the tests are run locally or in CI, the authentication methods can be different (e.g. to use a private docker registry in CI), and can be shared between packages (e.g. multiple packages using aws, or kubernetes providers), maybe a way to solve this is to don't include the providers definition in scenarios and generate them when running the tests depending on some config. Another example is shared code itself, it may be tempting to rely on some resource to be always present in some cloud provider, but this can be unreliable and can difficult supporting packages living in their own repositories. An example for this is the usual network configurations that are needed when deploying instances in a cloud provider, maybe elastic-package should always provide some base resources when some specific providers are used, so scenarios can be simpler. Same thing with kubernetes, an scenario could define some kubernetes resources, but elastic-package would provide the cluster and the credentials.

mtojek commented 3 years ago

Thank you for sharing your mind, lot's of tricky ideas ;) I like the idea of kops.

In elastic/beats#17656 Blake extended mage goIntegTest for Metricbeat to be able to run tests in Kubernetes (with kind) apart of the usual docker compose. There it was also done in a generic way, one provider or the other were used depending on the available files. A similar approach could be followed here to continue supporting docker-compose, or if we want to support other providers in the future.

Honestly I think we're not there yet. First, the Elastic-Agent needs to support autodiscovery and Kubernetes runtime. Then we can think about potential integrations. Keep in mind that we'd like to examine integrations not the entire the end-to-end flow. I would leave the verification of the Elastic-Agent functionality in different runtime to the Agent or e2e-tests.

ChrsMark commented 3 years ago

@mtojek @ycombinator fyi for k8s package testing I'm using some mock APIs so as to proceed until we reach to a more permanent solution. You can find more at https://github.com/elastic/integrations/pull/569.

While working with these mocks I realise more the need for running against an actual k8s cluster and more specifically having Agent deployed on the cluster natively. Without this, many things like k8s tokens crts etc we need will not be valid.

ycombinator commented 3 years ago

While working with these mocks I realise more the need for running against an actual k8s cluster and more specifically having Agent deployed on the cluster natively. Without this, many things like k8s tokens crts etc we need will not be valid.

This is super valuable information. @mtojek and I have informally discussed the idea that for some service deployers it might make sense to deploy the agent "along side" the service — your findings seem to be along these lines so this is very valuable feedback. Thank you!

mtojek commented 3 years ago

@kaiyan-sheng AWS integration can be tested now using the Terraform executor (sample here: https://github.com/elastic/integrations/tree/master/packages/aws/data_stream/ec2_metrics).

@narph this feature is written in a generic way. If you pass secrets for Azure and write some TF code, it's expected to work.

EDIT:

we just need to enable secrets on the Jenkins side, but shouldn't be a big issue (unless we don't have them generated at all).

mtojek commented 3 years ago

Let me summarize it -

We've delivered (and applied in Integrations):

generic Terraform service deployer, that currently supports AWS and possibly other providers like Azure, GCP, etc., using environment variables to pass credentials
Kubernetes service deployer which uses kind and potentially additional resources (e.g. custom application deployment).

elastic / elastic-package

[System test runner] Add more service deployers #89

Vanilla Kubernetes

OCP