cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.04k stars 657 forks source link

Improve support for diagnostics of failed compilation: flag to preserve compilation source packages and logs #2481

Open gberche-orange opened 12 months ago

gberche-orange commented 12 months ago

Is your feature request related to a problem? Please describe.

As a bosh operator, In order to diagnose root cause of bosh release compilation I need access compilation environment after the compilation failed (such as compilation log files) And I need to be able to retry executing the failing package compilation command, possibly with additional debugging flags

Currently, after enabling the director.debug.keep_unreachable_vms property, I'm able to ssh into the compilation vm, but the bosh agent quickly removes the compilation data as observed in the following sample trace:

/var/vcap/bosh/log# less current 
[...]
2023-12-07_13:19:37.48530 [File System] 2023/12/07 13:19:37 DEBUG - Remove all /var/vcap/data/compile/galera

Describe the solution you'd like

A flag in the bosh director instructing to skip cleaning the file system on compilation failure. This flag would potentially have the side effect of preventing reuse of this faulty compilation (as its file system might fill up, preventing new compilation jobs from properly succeeding)

Describe alternatives you've considered

The current workaround is to race with the bosh agent and make a file system copy of the /var/vcap/data/compile/\<package\> directory before the compilation job ends, and then work on the copy to perform compilation diagnostics and retries.

See full diagnostic example into https://github.com/orange-cloudfoundry/paas-templates/issues/2209

Additional context

The bosh agent seems to currently unconditionally perform the clean up upon compilation completion. See likely related sources below

Clean up of the whole compilation directory https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L80-L85

after the packaging script completion https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L109-L113

rkoster commented 11 months ago

There is work underway to add the ability to compile bosh releases directly with the bosh-agent: https://github.com/cloudfoundry/bosh-agent/pull/315 This would allow for a docker-based local development workflow. Would that address your problem or is this for debugging issues in production?

gberche-orange commented 11 months ago

thanks @rkoster for the follow up. Our use-case is rather debugging issues compiling 3rd party bosh releases, in particular during our automated pre-compilation pipelines which we use to speed up bosh stemcell bumps.

/CC @o-orand @poblin-orange

We'll deeper study whether the replacement of our pipeline using bosh-director based bosh release compilation by new pipelines using the bosh agent could help with compilation reproducibility (more at https://github.com/orange-cloudfoundry/paas-templates/issues/2210 )and ease diagnostic support in case of errors.

From a 1st sight, I think that the bosh-agent based release compilation might be harder to ensure reproducibility of the target iaas stemcells used at runtime.: we'd need the new pipelines running the bosh-agent to be running in the target stemcell and iaas infrastructure. We have observed in the past that compilation failure that a useful signal of breaking changes in the infrastructure/stemcell. Such breaking change might be harder to detect when running the compiled release.

The docker-based bosh-agent approach might therefore not be adapted to our use-case to debug failing compilation: the overhead to recreate the build environment is likely larger than the workaround described at https://github.com/cloudfoundry/bosh/issues/2481#issue-2031174264