Slurm integration - Githubissues

lars-t-hansen commented 1 year ago

Mostly some notes for now, not sure yet where this fits in or how much is needed / desirable. Slurm logs have data that we can't obtain with Sonar even in principle (such as waiting times), and sometimes data that are better (such as precise run time and resource use), and some of it may be useful to augment Sonar logs.

A conversation on Mattermost:

(bart) Does anyone know if we've some stats/graphs on the actual load/wait time of gpu jobs on Fox e.g. grafana? Guess we can fetch something from the slurmdb, but perhaps there is something already available?

(ole) squeue | grep accel | wc -l

(bjørn helge) sacct, or direct sql to the mariadb will give wait and run times of jobs, current and historic. For current and pending, squeue is a good option, as Ole suggested.

[x] #381
[x] #382
[x] #549
[ ] (Bug TBD) ability to join sonar and slurm data within sonalyze, TBD

Sabryr commented 12 months ago

@lars-t-hansen When it comes to HPC the current solution looks like this : https://fitsm.pages.sigma2.no/isrm/meeting-minutes/2023/09/2023-09-13-MEETING-REC-ISRM-ukevakt.html

I will start a mail tread with Andreas Bach for you to get more details on where the data comes from.

lars-t-hansen commented 10 months ago

Another aspect of Slurm integration is to compute policy violations: serious discrepancies between requested resources (cpu, memory, gpu) and actual use in the time the application is active. To do this, we must compare either sonar data to slurm requests, or slurm usage statistics to slurm requests. It will be important to not duplicate existing functionality in our pipeline, so if slurm tools already have this, then at most we want to expose existing slurm data in our dashboard, and add value where we have it.

A complementary view is that the stream of data from sonar is heterogeneous, see eg https://github.com/NordicHPC/sonar/issues/105, and that sonar can obtain the data from the batch system and forward it to our pipeline. But it's not obvious that this is the best method for cross-node data from slurm.

lars-t-hansen commented 10 months ago

One item mentioned in the requirements is "queue length", part of the main dashboard.

lars-t-hansen commented 8 months ago

I'm going to unhook this from the Fox milestone since it's unclear precisely what we want or need here.

lars-t-hansen commented 5 months ago

Looking at the jobgraph sources, it runs sacct -j jobno --json to get information for a (completed) job. We may wish to do something similar. This could lead to a system where, on a cluster, a periodic job contacts naic-monitor to get a list of jobs finished in the interval since the last time we got data and then runs sacct to get information about those jobs, which is then exfiltrated. We could make this resilient and we could probably avoid local state on the cluster: on naic-monitor there would be a database of jobs whose information we've obtained and those that are still pending. It's still annoying that the job on the cluster has to perform a GET but it's not the end of the world and there won't be much data. I think there would probably be only one job like this on the cluster, running on only one node (though some kind of redundancy could be nice). Given the light load that this incurs we could run it as often as every 5 minutes in order to have timely data.

The main issue is probably whether a random user such as sonar-runner is allowed to run sacct for jobs that don't belong to it. On fox, running as myself, I can run sacct on jobs belonging to other users, but it's possible I'm in some group that allows that.

lars-t-hansen commented 5 months ago

Is this more than curl to obtain a job list + sacct for each job in the list + curl to exfiltrate? The json appears to be totally self-contained and the outputs could just be catenated. The json output is fairly large. An alternative to --json is -lp, ie long format but parseable; this prints two lines, header and data; to omit header, also add -n; we could print a header for the first job and not for subsequent ones. The output is poor man's CSV, with | terminating fields:

[ec-larstha@login-3 ~]$ sacct -j 281153 -lp
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
281153|281153|SAM|accel||||||||||||||||||4|00:00:12|FAILED|1:0||Unknown|Unknown|Unknown|64G|0|||||||||billing=19,cpu=4,gres/gpu=1,mem=64G,node=1|billing=19,cpu=4,gres/gpu:a100=1,gres/gpu=1,mem=64G,node=1||||||||||||||
281153.batch|281153.batch|batch||235432K|gpu-2|0|235432K|0|gpu-2|0|0|0|gpu-2|0|0|00:00:02|gpu-2|0|00:00:02|1|4|00:00:12|FAILED|1:0|734.14M|0|0|0||0|0|gpu-2|0|0|0|gpu-2|0|0||cpu=4,gres/gpu:a100=1,gres/gpu=1,mem=64G,node=1|cpu=00:00:02,energy=0,fs/disk=0,mem=0,pages=0,vmem=235432K|cpu=00:00:02,energy=0,fs/disk=0,mem=0,pages=0,vmem=235432K|cpu=gpu-2,energy=gpu-2,fs/disk=gpu-2,mem=gpu-2,pages=gpu-2,vmem=gpu-2|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:02,energy=0,fs/disk=0,mem=0,pages=0,vmem=235432K|cpu=gpu-2,energy=gpu-2,fs/disk=gpu-2,mem=gpu-2,pages=gpu-2,vmem=gpu-2|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:02,energy=0,fs/disk=0,mem=0,pages=0,vmem=235432K|energy=0,fs/disk=0|energy=gpu-2,fs/disk=gpu-2|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
281153.extern|281153.extern|extern||7316K|gpu-2|0|7316K|0|gpu-2|0|0|0|gpu-2|0|0|00:00:00|gpu-2|0|00:00:00|1|4|00:00:12|COMPLETED|0:0|2.20G|0|0|0||0|0.01M|gpu-2|0|0.01M|0|gpu-2|0|0||billing=19,cpu=4,gres/gpu:a100=1,gres/gpu=1,mem=64G,node=1|cpu=00:00:00,energy=0,fs/disk=5329,mem=0,pages=0,vmem=7316K|cpu=00:00:00,energy=0,fs/disk=5329,mem=0,pages=0,vmem=7316K|cpu=gpu-2,energy=gpu-2,fs/disk=gpu-2,mem=gpu-2,pages=gpu-2,vmem=gpu-2|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=5329,mem=0,pages=0,vmem=7316K|cpu=gpu-2,energy=gpu-2,fs/disk=gpu-2,mem=gpu-2,pages=gpu-2,vmem=gpu-2|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=5329,mem=0,pages=0,vmem=7316K|energy=0,fs/disk=0|energy=gpu-2,fs/disk=gpu-2|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|

It's possible the JSON has more data than that, eg it has the user name and working directory and other things. I guess this opens the question of filtering / what data do we need. There is an sacct option to select specific fields.

lars-t-hansen commented 5 months ago

Multiple job numbers can be given to sacct at the same time so hopefully this is a single sacct invocation for even a long list of jobs.

bast commented 5 months ago

My plan for the revival of jobgraph was to get Slurm data locally on the cluster from sacct "directly" and to fetch the Sonar data from where they are stored.

But I can see the advantage of having Slurm data close to the rest of the data for other analyzes.

lars-t-hansen commented 2 months ago

Coming back to this now. We discussed this at the meeting in Åre. A contender for "how to get data" is to hook into the Slurm start and stop scripts and extract per-job information: at start time, what the job parameters look like; at end time, what the usage was (as seen from slurm). Possibly, while the job is running we can post data about its state, though that is what sonar is for so I'm not sure it's all that interesting.

According to both Øystein and Andreas, really bad things happen if slurm IDs are reused, so we can probably assume (if we must) that that won't happen.

Transmission and storage: I think in keeping with the spirit of Sonar, the compute node or the slurm master node (whatever is most appropriate) just sends off (by curl) whatever data it has on some convenient structured format. On the server side, we ingest this into text files which we can later process. These will be separate from the sonar sample and sysinfo data. If job IDs are never reused then a shallow hierarchy under data/(cluster) that is keyed by eg job-id mod 100 may be ok, ie, data/fox.educloud.no/slurm-job/7303/730312.txt would have all data for that job, and we'd limit ourselves to 100 files per directory. (The value of "100" and ".txt" TBD.) Obviously we could do multiple levels, say we do mod 1000 repeatedly: .../slurm-job/0/730/730312.txt, .../slurm-job/1/1936/1936732.txt.

Collection frequency: Assuming start/end job, for the time being.

Data we need, data we can have: Lots of notes from Åre. In addition to job resource request, usage, and exit code, we may want project number but also the modules loaded when the job started (presumably part of the job script?).

Parsing and extraction: I'm of two minds about whether to hook this into sonalyze jobs or add a sonalyze slurm command to extract slurm information for a given job. Maybe both. It would make sense for there to be fields emitted with sonalyze jobs that speak of the reservation size, exit code, project number, and details of total resource usage that slurm can know but sonar cannot. It might make less sense for eg the list of modules active at the start of the job to be emitted by that command, though on the other hand, why not? They are attributes of specific jobs, and the fact that slurm data are handled behind the scenes is hardly an argument for making things weird.

There would be some filtering by these new data fields, likely: --failed (complementing --running and --completed that we already have) for jobs with nonzero exit codes; --array / --stepped / --het for these types of jobs; --project might allow finding all jobs for a project; --uses modulename selects jobs that used that module.

lars-t-hansen commented 2 months ago

We might want a sonalyze modules verb (or a sonalyze slurm --modules option) that rips through slurm data without worrying about sample data and extracts info about modules that were in use when batch jobs started. Ditto sonalyze slurm --projects could summarize data for individual projects. These would be almost trivial to write. We should only do this if the results are (a) useful and (b) not easily obtained with other slurm commands, though given that we are building up a historical database, it might be useful even if we duplicate functionality.

lars-t-hansen commented 2 months ago

There is now a branch for this, larstha-66-slurm, with evolving sketchs.

lars-t-hansen commented 2 months ago

Jobs that completed between 9am and now with statistics, for example:

sacct -ap -o JobIDRaw,User,JobName,State,CPUTime,Elapsed,ExitCode,ReqCPUs,ReqMem,ReqNodes,AllocNodes,AllocCPUs,Start,End -S 09:00:00 -E now -s BF,CA,CD,DL,F,NF,OOM,PR,RV,TO

This is 25KB of data on Fox, but it gzips to < 2KB, so transmission overhead should not be an issue if this is done every hour, for the last hour's data.

We can similarly get a list of currently-running jobs. Really we're not interested in that, we're just interested in getting the jobs that started during the last hour (we don't care about ongoing stats). It would be possible to filter on the Start field.

Somewhat bizarrely with -s R, sacct also lists COMPLETED jobs. They too can be filtered before transmission.

lars-t-hansen commented 2 months ago

~~Prototype code to postprocess and filter sacct data and format as csv for transport in https://github.uio.no/larstha/uio-sandbox/tree/master/slurm.~~

Moved to the branch mentioned above: https://github.com/lars-t-hansen

lars-t-hansen commented 2 months ago

Not actually sure the storage format in https://github.com/NAICNO/Jobanalyzer/issues/66#issuecomment-2186299150 is the most sensible. sonalyze is fundamentally timeline based, and every operation needs to state the time window it is operating within. It is only when we talk about specific job IDs that the time window could in principle be ignored (and on the ML nodes it can't be because job IDs can be reused). It foes add friction if, when one wants to examine the records for a job, one has to know the job window within which to search, but I'm not sure this is as big a deal as it might look. To get to the point where we have a job ID to talk about we often have to have gone through a time based query to generate candidate job IDs.

It may simply be that slurm data received from the clusters should be timestamped (as the sample and sysinfo data are) and should be stored in the database under their timestamps, in the same way. Any indexing we need should be layered on top of that.

lars-t-hansen commented 1 month ago

Following the structure of the sonar data, we'd have $cluster/year/month/day/sacct-log.csv for example, which is just appended to as data come in. The timestamp is in the record (v=...,time=...,data1=...,...). We'd have new ingest functionality under sonalyze add -sacct-info and maybe also sonalyze sacct-info but really, why bother with that?

NAICNO / Jobanalyzer

Slurm integration #66