galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Restarting Pulsar is not safe #340

Closed natefoo closed 11 months ago

natefoo commented 1 year ago

It breaks jobs that were in the process of staging. It will in fact try to resume staging in (that's good!) but fails because it doesn't understand files that have already been fully transferred, e.g.:

2023-11-02 17:54:16,796 DEBUG [pulsar.managers.staging.pre][[manager=jetstream2]-[action=preprocess]-[job=53412028]] Staging jobdir 'tool_script.sh' via FileAction[path=/corral4/main/jobs/053/412/53412028/tool_script.sh,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F412%2F53412028%2Ftool_script.sh&file_type=jobdir] to /jetstream2/scratch/main/jobs/53412028/tool_script.sh
2023-11-02 17:54:16,821 INFO  [pulsar.client.transport.curl][[manager=jetstream2]-[action=preprocess]-[job=53412028]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F412%2F53412028%2Ftool_script.sh&file_type=jobdir will resume at 1789 bytes
Nov 02 17:54:16 jetstream2.galaxyproject.org pulsar[1233136]: 2023-11-02 17:54:16,826 INFO  [pulsar.managers.util.retry][[manager=jetstream2]-[action=preprocess]-[job=53408428]] Failed to execute action[Staging jobdir 'tool_script.sh' via FileAction[path=/corral4/main/jobs/053/408/53408428/tool_script.sh,action_type=remote_transfer,url=https://galaxy-web-03.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F408%2F53408428%2Ftool_script.sh&file_type=jobdir] to /jetstream2/scratch/main/jobs/53408428/tool_script.sh], retrying in 4.0 seconds.
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/pre.py", line 20, in <lambda>
    action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 479, in write_to_path
    get_file(self.url, path)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/curl.py", line 99, in get_file
    raise Exception(message)
Exception: Failed to get_file properly for url https://galaxy-web-03.galaxyproject.org/_job_files?job_id=beef&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F408%2F53408428%2Ftool_script.sh&file_type=jobdir, remote server returned status code of 416.
mvdbeek commented 1 year ago

it doesn't understand files that have already been fully transferred

pulsar shouldn't have to know that. Doesn't the 416 response from Galaxy indicate that the job isn't active anymore ?

natefoo commented 1 year ago

No, I don't think so - when we spoke about these errors before I was confusing it with Pulsar trying to stage out data for jobs that Galaxy already considers terminal (e.g. due to user deletion), but the 416 here I believe is because Pulsar sets the offset of completed stage-in files to EOF+1 and then tries to request that from Galaxy (nginx x-accel-redirect), which returns 416 since it can't seek beyond the end of the file.

mvdbeek commented 1 year ago

Thanks, I was wondering where that is coming from.

mvdbeek commented 11 months ago

Fixed in https://github.com/galaxyproject/pulsar/pull/348