I had previously thought this was always Pulsar trying to staging out the outputs of jobs that the user cancelled (thus since the job was terminal, Galaxy would refuse the request). This seems to not always be the case. For example, this job:
galaxy_main=> select * from job_state_history where job_id = 53718577 order by id;
id | create_time | job_id | state | info
-----------+----------------------------+----------+---------+------
196605912 | 2023-11-16 10:56:26.502274 | 53718577 | new |
196606003 | 2023-11-16 10:57:20.063809 | 53718577 | queued |
196606007 | 2023-11-16 10:57:21.391202 | 53718577 | queued |
196606066 | 2023-11-16 10:57:47.248204 | 53718577 | running |
196607658 | 2023-11-16 11:11:08.48592 | 53718577 | error |
(5 rows)
This job preprocesses fine, submits to the cluster, is running at 10:57:47, finishes some time just before 11:03:01, when it starts trying to POST the output:
2023-11-16 11:03:27,388 INFO [pulsar.managers.util.retry][[manager=jetstream2]-[action=postprocess]-[job=53718577]] Failed to execute staging out file /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat via FileAction[path=/corral4/main/objects/5/5/5/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fobjects%2F5%2F5%2F5%2Fdataset_55578e9b-3
cce-4afa-aee2-124d282668d3.dat&file_type=output], retrying in 2.0 seconds.
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
return fun(*args, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 482, in write_from_path
post_file(self.url, pulsar_path)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/curl.py", line 77, in post_file
raise Exception(message)
Exception: Failed to post_file properly for url https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fobjects%2F5%2F5%2F5%2Fdataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat&file_type=output, remote server returned status code of 403.
The file exists:
[rocky@jetstream2 ~]$ ls -lh /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat
-rw-r--r--. 1 g2main G-803372 1.3G Nov 16 11:02 /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat
This retries and fails until giving up at 11:11:02 and moving on to other outputs, which POST fine:
2023-11-16 11:11:02,505 INFO [pulsar.client.staging.down][[manager=jetstream2]-[action=postprocess]-[job=53718577]] collecting output COMMAND_VERSION with action FileAction[path=/corral4/main/jobs/053/718/53718577/outputs/COMMAND_VERSION,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F718%2F53718577%2Foutputs%2FCOMMAND_VERSION&file_type]
The only thing logged on Galaxy's side are the requests from uvicorn:
I had previously thought this was always Pulsar trying to staging out the outputs of jobs that the user cancelled (thus since the job was terminal, Galaxy would refuse the request). This seems to not always be the case. For example, this job:
This job preprocesses fine, submits to the cluster, is running at 10:57:47, finishes some time just before 11:03:01, when it starts trying to POST the output:
The file exists:
This retries and fails until giving up at 11:11:02 and moving on to other outputs, which POST fine:
The only thing logged on Galaxy's side are the requests from uvicorn:
So there is no indication of what the actual problem is.