NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

NRIS leadership meeting prototype report #541

Open lars-t-hansen opened 3 months ago

lars-t-hansen commented 3 months ago

(Also see #522 and all the reports in the adhoc-reports directory.)

Scope: Produce a script that generates text and graphics and is manually run when necessary whose output is (plausibly) suitable for the thrice-annual NRIS leadership meeting.

Deadline: September 2024 meeting (a little tight).

Contents: We have some issues because we currently receive no data from NRIS systems, and from Fox we get no Slurm data. (We can hope that we have both in mid-August but I'm not holding my breath.) So a report for the September meeting will not show off anything near our full (eventual) capability.

All that said, we possibly want

@sabryr, please add concrete ideas.

Sabryr commented 3 months ago
Sabryr commented 2 months ago

Date for the meeting is September 4,2024 and slot request sent to meeting organizers

Sabryr commented 2 months ago

Some discussions are here https://gitlab.sigma2.no/naic/wp2/identify-most-resource-intensive-users/-/issues/6.

Sabryr commented 2 months ago

If this make sense we can come up with some categories for the prototype

lars-t-hansen commented 2 months ago

Parameters:

Reports, in descending priority order roughly:

"underutilization" is a tricky concept. One thing is what they used vs what they booked: for a CPU for example, what they "booked" is elapsed time x number of cores. What they used is the sum of CPU time across the cores; that's one view. But a different view is that if they used 100% of the cores at 100% for some period of time, and much less the rest of the time, then the system is still underutilized on average but not at peak, and really what's going on is that there are possibly multiple job steps that could have different resource limits. That last bit is important but hard for users to relate to, probably. So a useful qualifier on the report would be information about the % of time that the job had poor utilization, say. A job that underutilizes the system 100% of the time is perhaps worse than one that underutilizes it 50% of the time?

Sabryr commented 2 months ago
Sabryr commented 2 months ago

Your comments came in on my side after I typed, and I think you have a better approach to the issue. Lets present "Reports, in descending priority order roughly:" as you mentioned. After presenting we can get feedback and iteratively improve.

lars-t-hansen commented 1 month ago

There's a lot of noise in the data so let's focus a little further. Observe that by and large, jobs that run quickly (even with large reservations) or reserve few resources are of no consequence for right-sizing (unless there are very many of them). Also, for obvious reasons (testing, development, profiling, crashes) the short and/or small jobs are going to be plentiful no matter what we do, you can't right-size these because size is beside the point. It's the large production jobs that run for a long time with too-large reservations that are the real problem.

Based on slurm data from Fox, here is a heat map of regular (non-array) jobs that run for at least 2h real time and reserve at least 20GB RAM and 32 CPUs. The cells count how many jobs of the selected ones fit into each category. Color is supposed to indicate how "hot" things are, darker colors are hotter. CPU reservation runs to the right along the X axis, memory reservation down along the Y axis, each box represents a 5% quantum of used-vs-reserved, ie, in the top left box the program used 1-5% of its reserved CPU and 1-5% of its reserved RAM (at peak).

fox-2h-20G-32cpu-3d

I've extracted only data from the last 3 days here, so this is probably not representative, but we see that the upper left quadrant is heavily overrepresented - out of 23 jobs (I'm looking at background data) - 16 are using at most 25% CPU and 35% memory. The remaining 7 are better balanced but could clearly be adjusted (one group underutilizes memory, the other underutilizes CPU).

There are probably statistical measures that capture the same thing as the heat map, although the heat maps are useful for visually exploring the parameter space. For a "right-sized" system we'd more or less expect that larger jobs will tend to be closer to the right bottom corner of the map.

Obvious questions here have to do with how large jobs have to be before they are large enough for this type of analysis, and whether the relative sizes of the jobs that pass that filter matter - or maybe, is a "large" job a job that is within some factor of the largest job? Clearly the cutoff may be relative to the size of machine, or the island, or each node, or whatever.

lars-t-hansen commented 1 month ago

I think that for the leadership meeting, the requirements are for

In particular, the leadership meeting - once every 4 months - will not be interested in details of individual jobs, why they are unbalanced, etc - that is advanced user support. The point for the leadership meeting is to make decisions about dedicating more resources to user support, analysis, etc.

The data I'm getting from Fox now are strongly multimodal and hard to present simply in a chart, but suppose we can come up with one metric per axis that is representative (I don't think the mean works well but for simplicity let's say we use the mean). Then for a set of basic selection criteria on the size of the job we get a bunch of job records that fall under those criteria (for example, at least 20GB RAM and 32 cores requested, or maybe there is min/max selection to get nonoverlapping sets). There can be several sets of these criteria, but if we have too many we'll get lost in the data. Then for a sliding time window across the list of jobs it is possible to plot the lines for all the metrics in the same chart. This will give an impression of evolution over time - are we improving?

The list of underutilizing and saturating users would be relative to each set of selection criteria (and would likely have additional selection criteria) and would probably be presented along with the graph.

The heat map may not have enormous relevance for the leadership meeting - it gives an impression of how things have been over the time period it represents, and displays the multimodality of the data, but it has no history (unless we animate it) and is not easy to subject to further analysis. So maybe it's worthwhile presenting it, to give an impression of the situation, but it can't be the main deliverable.

lars-t-hansen commented 1 month ago

I guess we have produced code, slides, and a report now. It's not completely spot-on for what's outlined above but it's pretty good, as things go, in particular it speaks to my comment https://github.com/NAICNO/Jobanalyzer/issues/541#issuecomment-2277753193 reasonably well and should be at a sensible level for the leadership meeting. And it's easy to see how to turn it into a standard report that produces, say, HTML + images, markdown, or similar.

There's still an argument to be made that there is lower-level information available that can be given to AUS type people to help users improve their job, and the longer writeup has some of that detail, if needed.

Going to leave this ticket open so that we can maybe produce a canned artifact that produces a report once we decide how that should look, but will remove the pri:high label.

lars-t-hansen commented 1 month ago

The adhoc-reports code that's currently on larstha-459-sacct should not land with that branch but should move to this one.

lars-t-hansen commented 1 month ago

Blocked on:

lars-t-hansen commented 1 month ago

I'm going to dump various notes about analyses here.


(From Sabry)


The heat map can be made interactive and can possibly be made to "explore" data, but I'm not sure what the utility is yet.


The heavy-gpu-users report (#522) is the type of thing that will allow us to find users that should be moved to bigger systems, maybe.