An exception occured in SamplingPoolPbs._qstat_pbs_job: Exception: qstat: Unknown Job Id 5445226.meta-pbs.metacentrum.cz
qstat -x *self._pbs_ids was at blame.
That error shouldn't ever occur, because self._pbs_ids contains only job ids from qsub run.
I think that error occurs because there is some kind of qstat 'forgetfulness' of terminated jobs.
It is a very rare phenomenon, which is observed only during the long run (e.g. 8 hours for a job) of many (e.g. 2500) simulations, here it happened after around a day and a half of running MLMC.
Proposed workaround: remove job ids of long-finished jobs from self._pbs_ids. Perform it in periods - 24 hours should be enough.
An exception occured in
SamplingPoolPbs._qstat_pbs_job
:Exception: qstat: Unknown Job Id 5445226.meta-pbs.metacentrum.cz
qstat -x *self._pbs_ids
was at blame. That error shouldn't ever occur, becauseself._pbs_ids
contains only job ids fromqsub
run.I think that error occurs because there is some kind of
qstat
'forgetfulness' of terminated jobs.It is a very rare phenomenon, which is observed only during the long run (e.g. 8 hours for a job) of many (e.g. 2500) simulations, here it happened after around a day and a half of running MLMC.
Proposed workaround: remove job ids of long-finished jobs from
self._pbs_ids
. Perform it in periods - 24 hours should be enough.