NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Continued issues with CPU > 100% (batchless only) #77

Open lars-t-hansen opened 10 months ago

lars-t-hansen commented 10 months ago

I noticed a couple of spikes well above 100% CPU in the plots for ml3 around 9/16 and 9/18. Digging in, these come from clusters of processes that probably have a lot of different processes, in this case for compilation:

$ ./sonalyze jobs --host ml3 -f4d --data-path data -u-
jobm     user      duration  host  cpu-avg  cpu-peak  mem-avg  mem-peak  gpu-avg  gpu-peak  gpumem-avg  gpumem-peak  cmd                                               
781733   karths    0d 0h15m  ml3   14       22        1        1         0        0         0           0            conda                                             
986189   sherinsu  0d 1h55m  ml3   849      6428      1        4         0        0         0           0            python3,xg++,cc1plus,sh,make                      
762349>  farihaho  0d20h10m  ml3   1        1         1        1         0        0         0           0            jupyter-lab                                       
831801>  nehad     0d20h 5m  ml3   1        1         1        1         0        0         0           0            jupyter-lab                                       
2363539  sherinsu  0d 2h35m  ml3   4506     40094     5        49        0        0         0           0            sqlitebin-pv5.1,python3,make,cc1plus,ninja,cmake  

The machine in question has 56 cores, so the CPU peak for the last job doesn't make a lot of sense (nor the one higher up).

As an hypothesis, there may be many processes in that job but they'll have one sample each - short-lived compilation jobs that are seen by the sampler. In that case, this being the only sample, the value picked up for CPU utilization is the true value from the OS, even if the job is idle. These processes are all merged into the same job even if there's only one sample, because when filtering by sample count we sample on the merged stream of samples.

(It's not necessarily wrong to filter that way, so this needs to be thought about carefully.)

lars-t-hansen commented 9 months ago

Case in point:

Screenshot from 2023-10-10 12-21-56

In this case there was a user who was overloading the machine seriously, but clearly this plot is off in a couple of ways:

lars-t-hansen commented 5 months ago

This remains annoying, throws off a bunch of graphs, and looks incompetent. We can't just pin every relative measure at 100% b/c for virtual memory that is not actually the ceiling, relative virtual memory not being a true percentage. But it may still be that a hard limit for some relative measures is the right thing.

lars-t-hansen commented 5 months ago

Also see https://github.com/NordicHPC/sonar/issues/89.

lars-t-hansen commented 4 months ago

Make sure that the profile plot also deals with this problem. Ideally, the fix is in sonalyze, so that all downstream code does not have to think about it, but I'm sure there are issues with that too.

lars-t-hansen commented 4 months ago

I'm thinking of a simple fix:

lars-t-hansen commented 4 months ago

Interesting, for profile there is already max to implement clamping, part of the aggregation args. This is not used by the dashboard when it generates the profile. Nor could it be, really, since the config file is not available to the dashboard.

Nor is the config file available to profile at present.

lars-t-hansen commented 4 months ago

Would it make sense for clamp to be a format option instead of a separate command line argument? The closest we come to something like that is the current nodefaults format options. The advantage would make it plain that it is a printing option that does not affect how data are processed and aggregated, only presented.

lars-t-hansen commented 4 months ago

Also observed: resident memory > 100%, probably another sampling phenomenon (ml6, 13 march). the system is extremely heavily loaded and even though rss never went above 100% of course, the way we accumulate and average the data may lead to these artifacts.

lars-t-hansen commented 3 months ago

There is a lengthy discussion on https://github.com/NordicHPC/sonar/issues/89 about what's going on here, and it essentially comes down to this:

The sonar fixes are straightforward and are pending on https://github.com/NordicHPC/sonar/pull/169. The sonalyze fix may not be a lot of code and the cost of the adjustment will likely be moderate but the logic is going to be quite tricky. It may be that this is a fix that we'll let be until the Go version is working, it might be easier to express in Go than in Rust. A design is forthcoming.

lars-t-hansen commented 1 month ago

Status and observations:

Given that it's not a problem for batch systems I'll remove the pri:high label and unblock M2.5 because batchless systems are increasingly an anomaly. The focus should be on systems with a batch queue.