Open gberche-orange opened 12 months ago
There is work underway to add the ability to compile bosh releases directly with the bosh-agent: https://github.com/cloudfoundry/bosh-agent/pull/315 This would allow for a docker-based local development workflow. Would that address your problem or is this for debugging issues in production?
thanks @rkoster for the follow up. Our use-case is rather debugging issues compiling 3rd party bosh releases, in particular during our automated pre-compilation pipelines which we use to speed up bosh stemcell bumps.
/CC @o-orand @poblin-orange
We'll deeper study whether the replacement of our pipeline using bosh-director based bosh release compilation by new pipelines using the bosh agent could help with compilation reproducibility (more at https://github.com/orange-cloudfoundry/paas-templates/issues/2210 )and ease diagnostic support in case of errors.
From a 1st sight, I think that the bosh-agent based release compilation might be harder to ensure reproducibility of the target iaas stemcells used at runtime.: we'd need the new pipelines running the bosh-agent to be running in the target stemcell and iaas infrastructure. We have observed in the past that compilation failure that a useful signal of breaking changes in the infrastructure/stemcell. Such breaking change might be harder to detect when running the compiled release.
The docker-based bosh-agent approach might therefore not be adapted to our use-case to debug failing compilation: the overhead to recreate the build environment is likely larger than the workaround described at https://github.com/cloudfoundry/bosh/issues/2481#issue-2031174264
Is your feature request related to a problem? Please describe.
As a bosh operator, In order to diagnose root cause of bosh release compilation I need access compilation environment after the compilation failed (such as compilation log files) And I need to be able to retry executing the failing package compilation command, possibly with additional debugging flags
Currently, after enabling the director.debug.keep_unreachable_vms property, I'm able to ssh into the compilation vm, but the bosh agent quickly removes the compilation data as observed in the following sample trace:
Describe the solution you'd like
A flag in the bosh director instructing to skip cleaning the file system on compilation failure. This flag would potentially have the side effect of preventing reuse of this faulty compilation (as its file system might fill up, preventing new compilation jobs from properly succeeding)
Describe alternatives you've considered
The current workaround is to race with the bosh agent and make a file system copy of the
/var/vcap/data/compile/\<package\>
directory before the compilation job ends, and then work on the copy to perform compilation diagnostics and retries.See full diagnostic example into https://github.com/orange-cloudfoundry/paas-templates/issues/2209
Additional context
The bosh agent seems to currently unconditionally perform the clean up upon compilation completion. See likely related sources below
Clean up of the whole compilation directory https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L80-L85
after the packaging script completion https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L109-L113