galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Job recovery failure causes Pulsar start failure if launch_config file does not exist #300

Closed natefoo closed 1 year ago

natefoo commented 1 year ago

On startup:

Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/stateful.py", line 263, in recover_active_jobs
    recover_method(job_id)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/base/external.py", line 66, in _recover_active_job
    raise Exception("Could not determine external ID for job_id [%s]" % job_id)
Exception: Could not determine external ID for job_id [45531995]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/main.py", line 161, in _app
    log=log,
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/main.py", line 132, in load_pulsar_app
    pulsar_app = pulsar.core.PulsarApp(**config)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/core.py", line 58, in __init__
    self.__recover_jobs()
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/core.py", line 106, in __recover_jobs
    manager.recover_active_jobs()
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/stateful.py", line 266, in recover_active_jobs
    self.__handle_recovery_problem(job_id)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/stateful.py", line 271, in __handle_recovery_problem
    self.__state_change_callback(status.LOST, job_id)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/messaging/bind_amqp.py", line 69, in bind_on_status_change
    payload = manager_endpoint_util.full_status(manager, new_status, job_id)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/manager_endpoint_util.py", line 24, in full_status
    full_status = __job_complete_dict(job_status, manager, job_id)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/manager_endpoint_util.py", line 56, in __job_complete_dict
    realized_dynamic_file_sources=realized_dynamic_file_sources(job_directory)
  File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/staging/post.py", line 45, in realized_dynamic_file_sources
    dynamic_file_sources = launch_config.get("dynamic_file_sources")
AttributeError: 'NoneType' object has no attribute 'get'

In this case, this is for a job in active-jobs but with no job directory. This just happened after a crash today but the job was created 6 days ago, and its job directory was probably recently cleaned up by tmpwatch. The job in question must have failed to complete, but I don't know why now unfortunately.