GeoMop / MLMC

5 stars 2 forks source link

Unknow Job Id #158

Closed martinspetlik closed 3 years ago

martinspetlik commented 3 years ago

An exception occured in SamplingPoolPbs._qstat_pbs_job: Exception: qstat: Unknown Job Id 5445226.meta-pbs.metacentrum.cz

qstat -x *self._pbs_ids was at blame. That error shouldn't ever occur, because self._pbs_ids contains only job ids from qsub run.

I think that error occurs because there is some kind of qstat 'forgetfulness' of terminated jobs.

It is a very rare phenomenon, which is observed only during the long run (e.g. 8 hours for a job) of many (e.g. 2500) simulations, here it happened after around a day and a half of running MLMC.

Proposed workaround: remove job ids of long-finished jobs from self._pbs_ids. Perform it in periods - 24 hours should be enough.