NordicHPC / sonar

Tool to profile usage of HPC resources by regularly probing processes using ps.
GNU General Public License v3.0
8 stars 5 forks source link

Get rid of `--batchless` option #182

Open lars-t-hansen opened 2 weeks ago

lars-t-hansen commented 2 weeks ago

On slurm systems, it will sometimes happen that a node is taken out of the slurm pool and turned into an interactive node, this seems to happen with some regularity on Fox. The admins are unlikely to remember to change the options to sonar in this case, or to change them back when the node goes back into the pool: non-slurm nodes need the --batchless option, slurm nodes run with --rollup. This turns into a mess on the backend where we have inappropriate job IDs for all the jobs from an interactive node sonar'd without --batchless (https://github.com/NAICNO/Jobanalyzer/issues/534)

Probably we should try to get rid of --batchless and just infer it.

On a slurm system, we will be able to find the slurm job ID in /proc/pid/cgroup (we should be a little more fastidious about what we match when we search that file). If the node is interactive, those data should not be there, and in that case, we should be able to fall back on using the process group, as we do for non-slurm systems.

One real complication is that a lot of processes on a slurm system will not have a slurm ID, because they are not slurm jobs, just normal processes. In that case, we currently use 0 for the job ID, as we must; no PID will be safe against confusion with the slurm job ID. This is unlike on batchless systems where we always use the pgrp except in corner cases (where we use 0). On a given system we will therefore need to compute slurm and pgrp job IDs for all the processes, and if we find a nonzero slurm ID for any process then we use the slurm IDs (and zeroes), and otherwise we use the pgrp IDs.

There will be a bit of surgery required for this, the "jobmanager" abstraction will probably disappear, and the data structures and printing will be embellished. We should be sure to leave copious comments.

Now about --rollup. If present, it should only have effect if we have Slurm job IDs (or other batch system IDs I guess). This would require a bit more surgery still.

lars-t-hansen commented 1 week ago

Another couple of complications.

First, on a node on a slurm system that sometimes runs slurm, and sometimes not, the slurm IDs will usually not be reused (by policy), but the non-slurm IDs will, and the non-slurm IDs may sometimes match valid slurm IDs from the same node at a different point in time - so it will be as if the IDs were reused after all.

Second, and worse, a non-slurm node using the pgrp for the job ID may accidentally match the slurm job ID from a slurm node at a time when both jobs are running. This isn't really a problem on the node side but it could be a problem on the back-end, which will almost certainly see the two jobs as part of the same job.

None of these problems are new if we infer --batchless from context, we already have them. In particular, I think the problem of two non-slurm nodes using the same pgrp ID for two unconnected jobs at the same time has been noted before.