Continued issues with CPU > 100% (batchless only) - Githubissues

NAICNO / Jobanalyzer

Easy to use resource usage report

MIT License

0 stars 1 forks source link

Continued issues with CPU > 100% (batchless only) #77

Open lars-t-hansen opened 10 months ago

lars-t-hansen commented 10 months ago

I noticed a couple of spikes well above 100% CPU in the plots for ml3 around 9/16 and 9/18. Digging in, these come from clusters of processes that probably have a lot of different processes, in this case for compilation:

$ ./sonalyze jobs --host ml3 -f4d --data-path data -u-
jobm     user      duration  host  cpu-avg  cpu-peak  mem-avg  mem-peak  gpu-avg  gpu-peak  gpumem-avg  gpumem-peak  cmd                                               
781733   karths    0d 0h15m  ml3   14       22        1        1         0        0         0           0            conda                                             
986189   sherinsu  0d 1h55m  ml3   849      6428      1        4         0        0         0           0            python3,xg++,cc1plus,sh,make                      
762349>  farihaho  0d20h10m  ml3   1        1         1        1         0        0         0           0            jupyter-lab                                       
831801>  nehad     0d20h 5m  ml3   1        1         1        1         0        0         0           0            jupyter-lab                                       
2363539  sherinsu  0d 2h35m  ml3   4506     40094     5        49        0        0         0           0            sqlitebin-pv5.1,python3,make,cc1plus,ninja,cmake

The machine in question has 56 cores, so the CPU peak for the last job doesn't make a lot of sense (nor the one higher up).

As an hypothesis, there may be many processes in that job but they'll have one sample each - short-lived compilation jobs that are seen by the sampler. In that case, this being the only sample, the value picked up for CPU utilization is the true value from the OS, even if the job is idle. These processes are all merged into the same job even if there's only one sample, because when filtering by sample count we sample on the merged stream of samples.

(It's not necessarily wrong to filter that way, so this needs to be thought about carefully.)

lars-t-hansen commented 9 months ago

Case in point:

Screenshot from 2023-10-10 12-21-56

In this case there was a user who was overloading the machine seriously, but clearly this plot is off in a couple of ways:

the chart is so tall now that we can't make out other interesting data
even if there could be a few jobs too many to drive CPU over 100%, this is absurd, and we should investigate further what the actual logs look like.

lars-t-hansen commented 5 months ago

This remains annoying, throws off a bunch of graphs, and looks incompetent. We can't just pin every relative measure at 100% b/c for virtual memory that is not actually the ceiling, relative virtual memory not being a true percentage. But it may still be that a hard limit for some relative measures is the right thing.

lars-t-hansen commented 5 months ago

Also see https://github.com/NordicHPC/sonar/issues/89.

lars-t-hansen commented 4 months ago

Make sure that the profile plot also deals with this problem. Ideally, the fix is in sonalyze, so that all downstream code does not have to think about it, but I'm sure there are issues with that too.

lars-t-hansen commented 4 months ago

I'm thinking of a simple fix:

a switch --clamp to sonalyze that introduces clamping behavior as a stopgap
requires the config file to be present (otherwise we can't clamp meaningfully)
naicreport + remote queries pass the necessary flags
this means that cpusec is clamped to something like numcores x time-interval, rssanon is clamped to memtotal, cpu% is clamped to 100 x numcores, at least for now, rss% is also clamped to 100
we could clamp gpu too but that does not seem to be an issue
clamping should happen close to the output layer
mostly it's important for jobs and load and probably profile, might not do it for parse (aka "export")

lars-t-hansen commented 4 months ago

Interesting, for profile there is already max to implement clamping, part of the aggregation args. This is not used by the dashboard when it generates the profile. Nor could it be, really, since the config file is not available to the dashboard.

Nor is the config file available to profile at present.

lars-t-hansen commented 4 months ago

Would it make sense for clamp to be a format option instead of a separate command line argument? The closest we come to something like that is the current nodefaults format options. The advantage would make it plain that it is a printing option that does not affect how data are processed and aggregated, only presented.

lars-t-hansen commented 4 months ago

Also observed: resident memory > 100%, probably another sampling phenomenon (ml6, 13 march). the system is extremely heavily loaded and even though rss never went above 100% of course, the way we accumulate and average the data may lead to these artifacts.

lars-t-hansen commented 3 months ago

There is a lengthy discussion on https://github.com/NordicHPC/sonar/issues/89 about what's going on here, and it essentially comes down to this:

sonar has the wrong idea of the job id for --batchless, the job id needs to just be the linux process group id
sonar allows --batchless and --rollup together, but this causes some data loss, so this must be prevented
sonar must emit enough data (basically the parent pid for each record) for sonalyze to reconstruct the process tree per node
sonalyze must reconstruct the process tree continuously along the timeline of a node and must ensure that when a process P1 in job J1 has a subprocess P2 within job J2, the cpu time that is propagated from P2 to P1 by linux when P2 exits is subtracted in the appropriate manner so that the costs for P1 and J1 are job-local and not affected by other jobs.

The sonar fixes are straightforward and are pending on https://github.com/NordicHPC/sonar/pull/169. The sonalyze fix may not be a lot of code and the cost of the adjustment will likely be moderate but the logic is going to be quite tricky. It may be that this is a fix that we'll let be until the Go version is working, it might be easier to express in Go than in Rust. A design is forthcoming.

lars-t-hansen commented 1 month ago

Status and observations:

the sonar fixes have landed and new sonar is about to be deployed on the ML nodes
this is a problem for batchless systems only (b/c batch systems don't have nested jobs, but a job queue)
sonalyze has been rewritten in Go and we're not blocked on that any more
there is a poc on larstha-77-job-tree, the logic is sort of knotty but it's not (yet) awful
most likely the cleanup will happen in logclean
it would be helpful to not do the cleanup if it isn't necessary, though i suspect it won't be horribly expensive, so it might not matter

Given that it's not a problem for batch systems I'll remove the pri:high label and unblock M2.5 because batchless systems are increasingly an anomaly. The focus should be on systems with a batch queue.