galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Clean the job directory when new jobs are received #284

Open natefoo opened 2 years ago

natefoo commented 2 years ago

I sometimes have to requeue jobs in Galaxy that have finished remotely but weren't finished properly in Galaxy. This is a problem if the job directory still exists on the Pulsar side and the job is sent to the same Pulsar as it was previously. Pulsar attempts to resume stage in files but fails:

Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: 2021-09-28 17:22:41,228 INFO  [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=37992802]] Failed to execute action[Staging input 'dataset_61602712.dat' via FileAction[path=/galaxy-repl/main/files/061/602/dataset_61602712.dat,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5c6ea3db5fc7ab0c5&job_key=c0ffee&path=/galaxy-repl/main/files/061/602/dataset_61602712.dat&file_type=input] to /jetstream/scratch0/main/jobs/37992802/inputs/dataset_61602712.dat], retrying in 6.0 seconds.
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: Traceback (most recent call last):
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: return fun(*args, **kwargs)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/staging/pre.py", line 19, in <lambda>
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/action_mapper.py", line 465, in write_to_path
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: get_file(self.url, path)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/transport/curl.py", line 93, in get_file
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: c.perform()
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: pycurl.error: (33, "HTTP server doesn't seem to support byte ranges. Cannot resume.")

This may be a more general problem as well of Pulsar not knowing the file length and attempting to fetch past the file. Which is to say, it should remove existing job directories when a new setup message is received, and it should also not attempt to resume past the file size when staging in (a separate issue).

gmauro commented 2 years ago

I have a cronjob deleting successful/unsuccessful job directories but, I agree a more structured approach would be needed.

natefoo commented 2 years ago

Yeah, I have a cron job running tmpwatch for this, which is needed regardless.