broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

`dockerImageUsed` call metadata key not reliably generated #4001

Open kshakir opened 6 years ago

kshakir commented 6 years ago

Similar to #3998 (backendStatus), but for the metadata key dockerImageUsed.

This call metadata key is written during job success by the engine. This key may be missing due to restarts of cromwell during centaur tests. Automated restarts of the centaur test end up call caching, where this key isn't written.

As a call cache hit technically doesn't have a dockerImage, it should be decided like in #3998 if the key dockerImageUsed should be written for cache hits.

https://github.com/broadinstitute/cromwell/blob/9bee537c5f6a9ff4e8597f75b6844c0eaee721cc/engine/src/main/scala/cromwell/engine/workflow/lifecycle/execution/job/EngineJobExecutionActor.scala#L279-L281

Example log of a failure during WIP of #3658 dockerImageUsed_missing.txt

strattan commented 4 years ago

We experience this issue in a production environment, too. It's a big problem because in our environment cromwell is part of an automated system that collects new data, runs analysis workflows, and accessions the results to a public archive. Part of the provenance metadata that goes along with workflow runs is the docker image id that was used during the run. Having a value for that key be missing sometimes breaks the code that passes that important provenance information on to the next level of metadata.

aednichols commented 4 years ago

Does your Cromwell routinely restart in the manner described in the ticket description? If you're using it in production, that seems less likely.