galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 49 forks source link

Don't immediately resume ``_monitor_active_jobs`` on exception #368

Closed mvdbeek closed 1 month ago

mvdbeek commented 1 month ago

Got a bunch of these on rockfish, and I don't think we're helping ourselves by calling os.listdir every 5ms:

2024-06-11 12:42:09,485 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,489 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,494 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'