fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.59k forks source link

Improve package building and testing. #3753

Closed niedbalski closed 3 months ago

niedbalski commented 3 years ago

Problem Description

The current workflow of building package is mostly manual. We have some automation testing on place, namely this workflow [0] Publication isn't automated and we don't have a staging repository to test installs and upgrades to the release bucket.

Proposed solution

  1. Create a workflow based on [0] that builds the packages for all the support distributions and architectures.
  2. The workflow should publish the package artifacts for each tagged release in a staging repository in s3.
  3. Workflow 1) triggers a workflow that runs a series of verification testing on top of the staging repository for all the supported distributions and architectures. Sanity testing should include:
  1. If workflow 3) succeeds, then an automated propagation of the staging packages should move the packages into the releases repository.

[0] https://github.com/fluent/fluent-bit/blob/master/.github/workflows/build-release.yaml

Known Limitations

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 5 days with no activity.

patrick-stephens commented 2 years ago

Got a limited POC going now to do all this in a single repo with an action: pushes to S3 for packages and GHCR for images. These are all staging and then next stage is to test and "bless", i.e. release.

patrick-stephens commented 2 years ago

Further discussion with @niedbalski has clarified a few things:

edsiper commented 2 years ago

@niedbalski @patrick-stephens

S3 should not be used for releases. Many users and customers have restricted access to S3 buckets and have whitelisted fluentbit domains to allow mirror the repos locally. We should continue using the native repos.

niedbalski commented 2 years ago

@edsiper @patrick-stephens

s3 can handle custom domains, the domain mapping shouldn't change, any existing whitelist related to packages.fluentbit.io and apt.fluentbit.io should remain the same, in fact, we are aiming for the release bucket to keep the same exact layout/structure without changes.

Enabling s3 has many benefits for us, including CDN, replication, backup, simplify the releases, etc.

patrick-stephens commented 2 years ago

Current plan therefore is to use a parallel workflow where we maintain the current process but also start producing the S3 bucket for release as well to evaluate. We also need to ensure build times are kept low, possibly by using a self-hosted runner for it.

niedbalski commented 2 years ago

@patrick-stephens

Here is my take for testing on top of staging:

  1. Images
  1. Packages.
patrick-stephens commented 2 years ago

Agreed, I think for golden config I'll add a dummy input & stdout output to exercise the pipeline a bit. This is what I've done previously and then you can easily check for the expected output too. Eventually we can evolve this to do more if we want.

In fact, the default config might be fine - it's a shame that the server is not defaulted to running (I know people get tripped up on the helm chart healthchecks by this). It does CPU and stdout already.

patrick-stephens commented 2 years ago

Staging build is almost there now, just resolving some GPG signing issues but should present an S3 bucket with all the repos set up correctly. Container images built, scanned (Trivy + Dockle) and signed (Cosign) before staging to ghcr.io.

Container testing as per the above is in place - verify each architecture image locally then use the Helm chart to verify in K8S deployment (whatever is the default in KIND when run). Package verification is in progress using kitchen-dokken: OS-based images for each target have the package installed and then we verify the service is running.

patrick-stephens commented 2 years ago

We will also look to trigger downstream integration and soak tests in staging to verify more things. @niedbalski I'll add workflow_call and workflow_dispatch to https://github.com/calyptia/fluent-bit-ci/blob/main/.github/workflows/main-gcp.yaml We then need to set up the soak test for some level of verification automatically but also manual approval for release.

We should get in the suggestions here: https://github.com/fluent/fluent-bit/issues/4389

niedbalski commented 2 years ago

In regards to integration testing:

  1. The staging build workflow will kick a external run on [0] using the new import semantics.
  2. The workflow [0] will kick a new set of integration tests based on the staging images provided via a workflow parameter.

[0] https://github.com/calyptia/fluent-bit-ci/blob/main/.github/workflows/main-gcp.yaml#L7

niedbalski commented 2 years ago

@patrick-stephens As a reference for the build/release to staging workflows.

For 4, that is covered by the private mirror due to the security concerns.

patrick-stephens commented 2 years ago

Need to add resilience and performance testing: https://github.com/fluent/fluent-bit/discussions/4390

patrick-stephens commented 2 years ago

Need to support package downgrade as well, i.e. official --> staging --> official and stays working. More distributions tested too.

patrick-stephens commented 2 years ago

Working on adding the release promotion job now:

https://github.com/fluent/fluent-bit/issues/4566

patrick-stephens commented 2 years ago

Packages (RPM + Deb) looks ok now, working on container release now.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

edsiper commented 3 months ago

is this ok to close?