galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Jobs take a long time to complete if all expected files are not created #327

Open martinwolst opened 1 year ago

martinwolst commented 1 year ago

I run a Pulsar setup where the Galaxy server and Pulsar end cannot see the same data storage locations, all inputs/outputs must be sent.

When I run shed-tools test for:

install_tool_dependencies: True
install_repository_dependencies: True
install_resolver_dependencies: True

tools:
- name: picard
  owner: devteam
  revisions:
  - 585027e65f3b
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

Test 'toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_artifact_metrics/2.18.2.1-1' fails due to an exception, I 'think' this is intended but I'm not sure and haven't been able to find the source for this specific test.

Because the tool fails at runtime, it does not create the full set of output files that pulsar is expecting to post back to galaxy, so once it has failed, pulsar gets into its file post re-try loop:

[pulsar.managers.queued_drmaa][[manager=_default_]-[action=preprocess]-[job=6480]] Submitted DRMAA job with Pulsar job id 6480 and external id 225017
Jun 27 11:20:05 REDACTED pulsar[13989]: 2023-06-27 11:20:05,225 INFO  [pulsar.client.staging.down][[manager=_default_]-[action=postprocess]-[job=6480]] collecting output OutPut.pre_adapter_detail_metrics with action FileAction[path=/REDACTED/4/f/4/dataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat,action_type=remote_transfer,url=https://REDACTED/api/jobs/REDACTED/files?job_key=REDACTED&path=%2FREDACTED%2F4%2Ff%2F4%2Fdataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat&file_type=output_workdir]
Jun 27 11:20:05 REDACTED pulsar[13989]: 2023-06-27 11:20:05,226 INFO  [pulsar.managers.util.retry][[manager=_default_]-[action=postprocess]-[job=6480]] Failed to execute staging out file /REDACTED/pulsar/files/staging/6480/working/OutPut.pre_adapter_detail_metrics via FileAction[path=/REDACTED/4/f/4/dataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat,action_type=remote_transfer,url=https://REDACTED/api/jobs/REDACTED/files?job_key=REDACTED&path=%2FREDACTED%2F4%2Ff%2F4%2Fdataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat&file_type=output_workdir], retrying in 8.0 seconds.
Jun 27 11:20:05 REDACTED pulsar[13989]: Traceback (most recent call last):
Jun 27 11:20:05 REDACTED pulsar[13989]: File "/REDACTED/pulsar/venv/lib/python3.6/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
Jun 27 11:20:05 REDACTED pulsar[13989]: return fun(*args, **kwargs)
Jun 27 11:20:05 REDACTED pulsar[13989]: File "/REDACTED/pulsar/venv/lib/python3.6/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
Jun 27 11:20:05 REDACTED pulsar[13989]: self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
Jun 27 11:20:05 REDACTED pulsar[13989]: File "/REDACTED/pulsar/venv/lib/python3.6/site-packages/pulsar/client/action_mapper.py", line 482, in write_from_path
Jun 27 11:20:05 REDACTED pulsar[13989]: post_file(self.url, pulsar_path)
Jun 27 11:20:05 REDACTED pulsar[13989]: File "/REDACTED/pulsar/venv/lib/python3.6/site-packages/pulsar/client/transport/curl.py", line 70, in post_file
Jun 27 11:20:05 REDACTED pulsar[13989]: raise Exception(message)
Jun 27 11:20:05 REDACTED pulsar[13989]: Exception: Attempt to post file /REDACTED/pulsar/files/staging/6480/working/OutPut.pre_adapter_detail_metrics to URL https://REDACTED/api/jobs/REDACTED/files?job_key=REDACTED&path=%2FREDACTED%2F4%2Ff%2F4%2Fdataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat&file_type=output_workdir, but file does not exist.
Jun 27 11:20:13 REDACTED pulsar[13989]: 2023-06-27 11:20:13,236 INFO  [pulsar.managers.util.retry][[manager=_default_]-[action=postprocess]-[job=6480]] Failed to execute staging out file /REDACTED/pulsar/files/staging/6480/working/OutPut.pre_adapter_detail_metrics via FileAction[path=/REDACTED/4/f/4/dataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat,action_type=remote_transfer,url=https://REDACTED/api/jobs/REDACTED/files?job_key=REDACTED&path=%2FREDACTED%2F4%2Ff%2F4%2Fdataset_4f4b8b9c-65c2-4fdb-86f9-7caf74070f6b.dat&file_type=output_workdir], retrying in 16.0 seconds.

and so on for ~15 minutes per file that is missing under my retry settings, which don't want to change because they cover for network errors.

Impacts This prevents the 'fail fast' principle of testing and wastes admin time - I feel this is the main negative of the behaviour, because testing relies on quick feedback.

In production it means that users spend time waiting for things they think are still running but actually failed, this has a moderate impact most noticed when users are doing training where the jobs are expected to finish quickly.

Suggested solution If Pulsar fails to post back because some of the expected files don't exist, as opposed to the connection dropping, it should not try again, just return the files that do exist (stderr and stout always seem to exist in that situation) and ignore the non-existent files.