NRIS leadership meeting prototype report

lars-t-hansen commented 3 months ago

(Also see #522 and all the reports in the adhoc-reports directory.)

Scope: Produce a script that generates text and graphics and is manually run when necessary whose output is (plausibly) suitable for the thrice-annual NRIS leadership meeting.

Deadline: September 2024 meeting (a little tight).

Contents: We have some issues because we currently receive no data from NRIS systems, and from Fox we get no Slurm data. (We can hope that we have both in mid-August but I'm not holding my breath.) So a report for the September meeting will not show off anything near our full (eventual) capability.

All that said, we possibly want

some inspiration from eg this ukevakt-report https://fitsm.pages.sigma2.no/isrm/meeting-minutes/2023/09/2023-09-13-MEETING-REC-ISRM-ukevakt.html, but it's not clear that we have anything but the "green" here
raw utilization of clusters?
queue waiting time
top software / tasks (we have reports for this)
top users / projects (we have reports for this)
"intensive" software / users ditto (#522)

@sabryr, please add concrete ideas.

Sabryr commented 3 months ago

Some indication about the gap between what people booked and what was used. (need to think more concrete)
- 2k CPU/GPU hours were booked, out of that 1.2k was used
Some indication of processes that should have asked more or different type of resource configurations.
- e.g. How many jobs that used all booked resources, with assistance those can be made to finish faster, if more resources are used.

Sabryr commented 2 months ago

Date for the meeting is September 4,2024 and slot request sent to meeting organizers

Sabryr commented 2 months ago

Some discussions are here https://gitlab.sigma2.no/naic/wp2/identify-most-resource-intensive-users/-/issues/6.

Sabryr commented 2 months ago

Suggestion | to expand "Some indication about the gap between what people booked and what was used." .
- Number of such cases groups according to
  - problem of imbalance,
  - was waiting for i/o
  - structure s.t. the reservation was right for one phase but not for the rest.

If this make sense we can come up with some categories for the prototype

lars-t-hansen commented 2 months ago

Parameters:

fox data only, but including slurm data
no i/o data except what we get from slurm

Reports, in descending priority order roughly:

report on underutilizing jobs, based on slurm data, with actionable info about project/user
report on saturating jobs (ditto, ditto) that could maybe move to a larger system
top software/tasks report (slurm and sonar will have different views here)
top users/projects report (users can be on several projects and a project can have several users so these may be two reports)
intensive users/projects report

"underutilization" is a tricky concept. One thing is what they used vs what they booked: for a CPU for example, what they "booked" is elapsed time x number of cores. What they used is the sum of CPU time across the cores; that's one view. But a different view is that if they used 100% of the cores at 100% for some period of time, and much less the rest of the time, then the system is still underutilized on average but not at peak, and really what's going on is that there are possibly multiple job steps that could have different resource limits. That last bit is important but hard for users to relate to, probably. So a useful qualifier on the report would be information about the % of time that the job had poor utilization, say. A job that underutilizes the system 100% of the time is perhaps worse than one that underutilizes it 50% of the time?

Sabryr commented 2 months ago

suggestion to expand "Some indication of processes that should have asked more or different type of resource configurations."
- Cases grouped according to
  - Not being explicit when asking for resource, e.g. ask for 64 cores, but did not specify they should be on the same machine
  - A lot of threads waited until there are free cores.

Sabryr commented 2 months ago

Your comments came in on my side after I typed, and I think you have a better approach to the issue. Lets present "Reports, in descending priority order roughly:" as you mentioned. After presenting we can get feedback and iteratively improve.

lars-t-hansen commented 1 month ago

There's a lot of noise in the data so let's focus a little further. Observe that by and large, jobs that run quickly (even with large reservations) or reserve few resources are of no consequence for right-sizing (unless there are very many of them). Also, for obvious reasons (testing, development, profiling, crashes) the short and/or small jobs are going to be plentiful no matter what we do, you can't right-size these because size is beside the point. It's the large production jobs that run for a long time with too-large reservations that are the real problem.

Based on slurm data from Fox, here is a heat map of regular (non-array) jobs that run for at least 2h real time and reserve at least 20GB RAM and 32 CPUs. The cells count how many jobs of the selected ones fit into each category. Color is supposed to indicate how "hot" things are, darker colors are hotter. CPU reservation runs to the right along the X axis, memory reservation down along the Y axis, each box represents a 5% quantum of used-vs-reserved, ie, in the top left box the program used 1-5% of its reserved CPU and 1-5% of its reserved RAM (at peak).

fox-2h-20G-32cpu-3d

I've extracted only data from the last 3 days here, so this is probably not representative, but we see that the upper left quadrant is heavily overrepresented - out of 23 jobs (I'm looking at background data) - 16 are using at most 25% CPU and 35% memory. The remaining 7 are better balanced but could clearly be adjusted (one group underutilizes memory, the other underutilizes CPU).

There are probably statistical measures that capture the same thing as the heat map, although the heat maps are useful for visually exploring the parameter space. For a "right-sized" system we'd more or less expect that larger jobs will tend to be closer to the right bottom corner of the map.

Obvious questions here have to do with how large jobs have to be before they are large enough for this type of analysis, and whether the relative sizes of the jobs that pass that filter matter - or maybe, is a "large" job a job that is within some factor of the largest job? Clearly the cutoff may be relative to the size of machine, or the island, or each node, or whatever.

lars-t-hansen commented 1 month ago

I think that for the leadership meeting, the requirements are for

high-level tracking data over time - how is utilization along multiple axes (at least cpu, ram, gpu, gpu ram)
for underutilizing jobs, at most a list of projects that are notorious underutilizers
maybe the same for saturating jobs

In particular, the leadership meeting - once every 4 months - will not be interested in details of individual jobs, why they are unbalanced, etc - that is advanced user support. The point for the leadership meeting is to make decisions about dedicating more resources to user support, analysis, etc.

The data I'm getting from Fox now are strongly multimodal and hard to present simply in a chart, but suppose we can come up with one metric per axis that is representative (I don't think the mean works well but for simplicity let's say we use the mean). Then for a set of basic selection criteria on the size of the job we get a bunch of job records that fall under those criteria (for example, at least 20GB RAM and 32 cores requested, or maybe there is min/max selection to get nonoverlapping sets). There can be several sets of these criteria, but if we have too many we'll get lost in the data. Then for a sliding time window across the list of jobs it is possible to plot the lines for all the metrics in the same chart. This will give an impression of evolution over time - are we improving?

The list of underutilizing and saturating users would be relative to each set of selection criteria (and would likely have additional selection criteria) and would probably be presented along with the graph.

The heat map may not have enormous relevance for the leadership meeting - it gives an impression of how things have been over the time period it represents, and displays the multimodality of the data, but it has no history (unless we animate it) and is not easy to subject to further analysis. So maybe it's worthwhile presenting it, to give an impression of the situation, but it can't be the main deliverable.

lars-t-hansen commented 1 month ago

I guess we have produced code, slides, and a report now. It's not completely spot-on for what's outlined above but it's pretty good, as things go, in particular it speaks to my comment https://github.com/NAICNO/Jobanalyzer/issues/541#issuecomment-2277753193 reasonably well and should be at a sensible level for the leadership meeting. And it's easy to see how to turn it into a standard report that produces, say, HTML + images, markdown, or similar.

There's still an argument to be made that there is lower-level information available that can be given to AUS type people to help users improve their job, and the longer writeup has some of that detail, if needed.

Going to leave this ticket open so that we can maybe produce a canned artifact that produces a report once we decide how that should look, but will remove the pri:high label.

lars-t-hansen commented 1 month ago

The adhoc-reports code that's currently on larstha-459-sacct should not land with that branch but should move to this one.

lars-t-hansen commented 1 month ago

Blocked on:

586
584

lars-t-hansen commented 1 month ago

I'm going to dump various notes about analyses here.

(From Sabry)

Shared memory or not. e.g. A user constantly asks for 64tasks and 128GB RAM on 2 nodes (may be a borrowed job script from a colleague). But the jobs does not implement message passing, resulting in resource on one node always being not used
Same thing with GPUs, are there jobs that we should be sure all requested GPUs are on one node
How to compensate for peaks in pipelines. E.g. a week long pipeline at one step uses 128Gb memory for 1 hour, otherwise less than 8Gb

The heat map can be made interactive and can possibly be made to "explore" data, but I'm not sure what the utility is yet.

The heavy-gpu-users report (#522) is the type of thing that will allow us to find users that should be moved to bigger systems, maybe.

NAICNO / Jobanalyzer

NRIS leadership meeting prototype report #541

586

584