galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Postprocessing POST returns 403 with nothing logged other than the request #342

Open natefoo opened 1 year ago

natefoo commented 1 year ago

I had previously thought this was always Pulsar trying to staging out the outputs of jobs that the user cancelled (thus since the job was terminal, Galaxy would refuse the request). This seems to not always be the case. For example, this job:

galaxy_main=> select * from job_state_history where job_id = 53718577 order by id;
    id     |        create_time         |  job_id  |  state  | info
-----------+----------------------------+----------+---------+------
 196605912 | 2023-11-16 10:56:26.502274 | 53718577 | new     |
 196606003 | 2023-11-16 10:57:20.063809 | 53718577 | queued  |
 196606007 | 2023-11-16 10:57:21.391202 | 53718577 | queued  |
 196606066 | 2023-11-16 10:57:47.248204 | 53718577 | running |
 196607658 | 2023-11-16 11:11:08.48592  | 53718577 | error   |
(5 rows)

This job preprocesses fine, submits to the cluster, is running at 10:57:47, finishes some time just before 11:03:01, when it starts trying to POST the output:

2023-11-16 11:03:27,388 INFO  [pulsar.managers.util.retry][[manager=jetstream2]-[action=postprocess]-[job=53718577]] Failed to execute staging out file /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat via FileAction[path=/corral4/main/objects/5/5/5/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fobjects%2F5%2F5%2F5%2Fdataset_55578e9b-3
cce-4afa-aee2-124d282668d3.dat&file_type=output], retrying in 2.0 seconds.
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
    self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 482, in write_from_path
    post_file(self.url, pulsar_path)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/curl.py", line 77, in post_file
    raise Exception(message)
Exception: Failed to post_file properly for url https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fobjects%2F5%2F5%2F5%2Fdataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat&file_type=output, remote server returned status code of 403.

The file exists:

[rocky@jetstream2 ~]$ ls -lh /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat
-rw-r--r--. 1 g2main G-803372 1.3G Nov 16 11:02 /jetstream2/scratch/main/jobs/53718577/outputs/dataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat

This retries and fails until giving up at 11:11:02 and moving on to other outputs, which POST fine:

2023-11-16 11:11:02,505 INFO  [pulsar.client.staging.down][[manager=jetstream2]-[action=postprocess]-[job=53718577]] collecting output COMMAND_VERSION with action FileAction[path=/corral4/main/jobs/053/718/53718577/outputs/COMMAND_VERSION,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fjobs%2F053%2F718%2F53718577%2Foutputs%2FCOMMAND_VERSION&file_type]

The only thing logged on Galaxy's side are the requests from uvicorn:

uvicorn.access INFO 2023-11-16 05:03:27,370 [pN:main.1,p:22463,tN:MainThread] 149.165.159.100:0 - "POST /api/jobs/bbd44e69cb8906b5f0413ba0976e0384/files?job_id=bbd44e69cb8906b5f0413ba0976e0384&job_key=c0ffee&path=%2Fcorral4%2Fmain%2Fobjects%2F5%2F5%2F5%2Fdataset_55578e9b-3cce-4afa-aee2-124d282668d3.dat&file_type=output HTTP/1.0" 403

So there is no indication of what the actual problem is.