arminbiere / runlim

Other
12 stars 4 forks source link

need better group and session id handling #8

Open arminbiere opened 5 months ago

arminbiere commented 5 months ago

For some runs on a busy cluster it seems that it rarely occured that process identifiers wrap around and actually just from group or session to child pid, which then is smaller than either the sesssion or group identifier:

../6s196.err:runlim error: group pid 4190248 larger than child pid 4251
../6s1.err:runlim error: session pid 4180036 larger than child pid 10322
../beembrptwo1b2.err:runlim error: session pid 4180065 larger than child pid 10321
../beembrptwo4b1.err:runlim error: group pid 4180039 larger than child pid 4369

This breaks some logic in runlim to determine whether a process is in the same process group, which in turn is used to kill zombie process more reliable in the last major update of runlim. So this is tricky business to get right anyhow. I document it here but have no idea how to fix it yet.

BenKaufmann commented 4 months ago

Just as an FYI (haven't look into the code yet). We want to switch to runlim but unfortunately run into this issue quite regularly on our cluster.

arminbiere commented 4 months ago

Thanks for the heads-up. I have a partial fix in 41779654a0cfc74aa1c1a360305e732ef036293d (on the master branch) which I used in the last weeks for single process work loads, where you can also specify --single. This has the effect to just sample the (assumed) single child process. If however you have more than one process '--single' does not make sense. On the other hand --single is cheaper in terms of sample effort, which was the original reason to have it as it does not require to traverse all the processes in the /proc file system. So if you can assume to have only one child process better use it anyhow.