Open lars-t-hansen opened 10 months ago
Case in point:
In this case there was a user who was overloading the machine seriously, but clearly this plot is off in a couple of ways:
This remains annoying, throws off a bunch of graphs, and looks incompetent. We can't just pin every relative measure at 100% b/c for virtual memory that is not actually the ceiling, relative virtual memory not being a true percentage. But it may still be that a hard limit for some relative measures is the right thing.
Make sure that the profile plot also deals with this problem. Ideally, the fix is in sonalyze, so that all downstream code does not have to think about it, but I'm sure there are issues with that too.
I'm thinking of a simple fix:
--clamp
to sonalyze that introduces clamping behavior as a stopgapjobs
and load
and probably profile
, might not do it for parse
(aka "export")Interesting, for profile
there is already max
to implement clamping, part of the aggregation args. This is not used by the dashboard when it generates the profile. Nor could it be, really, since the config file is not available to the dashboard.
Nor is the config file available to profile
at present.
Would it make sense for clamp
to be a format option instead of a separate command line argument? The closest we come to something like that is the current nodefaults
format options. The advantage would make it plain that it is a printing option that does not affect how data are processed and aggregated, only presented.
Also observed: resident memory > 100%, probably another sampling phenomenon (ml6, 13 march). the system is extremely heavily loaded and even though rss never went above 100% of course, the way we accumulate and average the data may lead to these artifacts.
There is a lengthy discussion on https://github.com/NordicHPC/sonar/issues/89 about what's going on here, and it essentially comes down to this:
--batchless
, the job id needs to just be the linux process group id--batchless
and --rollup
together, but this causes some data loss, so this must be preventedThe sonar fixes are straightforward and are pending on https://github.com/NordicHPC/sonar/pull/169. The sonalyze fix may not be a lot of code and the cost of the adjustment will likely be moderate but the logic is going to be quite tricky. It may be that this is a fix that we'll let be until the Go version is working, it might be easier to express in Go than in Rust. A design is forthcoming.
Status and observations:
Given that it's not a problem for batch systems I'll remove the pri:high label and unblock M2.5 because batchless systems are increasingly an anomaly. The focus should be on systems with a batch queue.
I noticed a couple of spikes well above 100% CPU in the plots for ml3 around 9/16 and 9/18. Digging in, these come from clusters of processes that probably have a lot of different processes, in this case for compilation:
The machine in question has 56 cores, so the CPU peak for the last job doesn't make a lot of sense (nor the one higher up).
As an hypothesis, there may be many processes in that job but they'll have one sample each - short-lived compilation jobs that are seen by the sampler. In that case, this being the only sample, the value picked up for CPU utilization is the true value from the OS, even if the job is idle. These processes are all merged into the same job even if there's only one sample, because when filtering by sample count we sample on the merged stream of samples.
(It's not necessarily wrong to filter that way, so this needs to be thought about carefully.)