galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Job setup failure is not reported back to Galaxy #129

Closed natefoo closed 6 years ago

natefoo commented 7 years ago

Attempting to write the job directory on a full disk and got this, after which the job stayed permanently "queued" in Galaxy. AMQP mode in case that matters:

2017-02-28 09:58:43,696 ERROR [pulsar.messaging.bind_amqp][consume-setup-amqp://main_lwr:********@galaxy03.tacc.utexas.edu:5671//main_lwr?ssl=1] Failed to setup job 15163417 obtained via message queue.
Traceback (most recent call last):
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/messaging/bind_amqp.py", line 116, in __process_setup_message
    manager_endpoint_util.submit_job(manager, body)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/manager_endpoint_util.py", line 83, in submit_job
    use_metadata
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/manager_endpoint_util.py", line 108, in setup_job
    job_id = manager.setup_job(job_id, tool_id, tool_version)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/stateful.py", line 60, in setup_job
    job_id = self._proxied_manager.setup_job(*args, **kwargs)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/base/external.py", line 31, in setup_job
    return self._setup_job_for_job_id(job_id, tool_id, tool_version)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/base/directory.py", line 48, in _setup_job_for_job_id
    self._setup_job_directory(job_id)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/base/__init__.py", line 145, in _setup_job_directory
    job_directory.setup()
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/base/__init__.py", line 278, in setup
    self._directory_maker.make(self.job_directory)
  File "/srv/pulsar/main/pulsar/venv/src/pulsar/pulsar/managers/base/__init__.py", line 397, in make
    os.mkdir(*makedir_args)
OSError: [Errno 28] No space left on device: '/jetstream/scratch0/main/jobs/15163417'
jmchilton commented 6 years ago

https://github.com/galaxyproject/pulsar/commit/783713c16058e269c8ab8f752c4ea3115dd6d7ad might have fixed - working on a test case now to see if I can verify.

jmchilton commented 6 years ago

Ah - never mind - 783713c might help make sure job staging problems get propagated correctly but not setup problems. Hmm...

natefoo commented 6 years ago

Confirmed you now get the ubiquitous Remote job server indicated a problem running or monitoring this job., which is a significant improvement over before.