galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Losing jobs on restart while postprocessing #354

Open natefoo opened 8 months ago

natefoo commented 8 months ago

I can't investigate this fully at the moment but I suspect this is possible because after a job has left the cluster, Pulsar:

  1. creates $staging_dir/$job_id/final_status with contents "complete",
  2. removes $persistence_dir/${manager}-active-jobs/$job_id,
  3. performs postprocessing (writing outputs back to Galaxy), and
  4. creates $staging_dir/$job_id/postprocessed.

Because $persistence_dir/${manager}-active-jobs/$job_id is removed before postprocessing completes, it would presumably not attempt to retry postprocessing after a restart.

EDIT: this definitely happens.