Open cms21 opened 1 year ago
Bump this in priority
Another user has encountered this on Polaris. The proposed solution that has been discussed was to parse the message that qstat returns for these jobs. It looks like this:
(2022-09-08/multirl) csimpson@polaris-login-02:~> qstat -f -x 456714
qstat: Unknown Job Id 456714.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
However the solution I gave to the user was to hack site/service/scheduler.py. This could also be a solution. The proposed change would be to replace https://github.com/argonne-lcf/balsam/blob/bfbf83355106c5c1a970c40f0317d2f1eeba6de2/balsam/site/service/scheduler.py#L157 with this:
try:
job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None))
except:
logger.exception(f"Job {job.scheduler_id} not found by scheduler")
continue
PR #345 fixes part of this issue. When Balsam queries PBS with qstat it will check if Unknown Job Id
is part of the returned message in the case of a non-zero return code. If this happens, the state is changed to submit_failed
.
This will not handle the situation of a Balsam site has been inactive for a period of longer than 2 weeks and was not able to get information on the finished batch job before PBS purges the record. In this case further development is needed and this PR will change its state to submit_failed erroneously. However, a user can fix the state of the batch job by hand. It's unclear how common of an issue the this second case is, but should be addressed.
This is an issue seen on Polaris with PBS Pro. Jobs that are submitted to the prod queue are routed to the small, medium, and large queues. If something about that routing fails the job disappears from PBS's history. However, the original qsub command succeeded. So to Balsam, it assumes the batch job is queued and tries to look for it with qstat, but qstat fails. This causes an uncaught exception that crashes the site. Sample error below.
2023-02-13 04:31:20.411 | 167662 | ERROR | balsam:120] Uncaught Exception <class 'balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode'>: qstat: Unknown Job Id 412635.polaris- pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" } Traceback (most recent call last): File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/util/process.py", line 17, in run self._run() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/service_base.py", line 23, in _run self.run_cycle() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/scheduler.py", line 154, in run_cycle job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None)) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 163, in parse_logs log_data = cls._parse_logs(scheduler_id, job_script_path) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/pbs_sched.py", line 300, in _parse_logs stdout = scheduler_subproc(args) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 37, in scheduler_subproc raise SchedulerNonZeroReturnCode(p.stdout) balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode: qstat: Unknown Job Id 412635.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" }