NAICNO / Jobanalyzer

Easy to use resource usage report
MIT License
0 stars 1 forks source link

Breakdowns by host and process #47

Closed lars-t-hansen closed 1 year ago

lars-t-hansen commented 1 year ago

Some of the behavior in #45 (now fixed) is mysterious because multiple processes are rolled together into a job. Sometimes it is desirable to break a multi-process job down by process, in the same way that a multi-host job can be broken down by host: only this way can we more easily see where to focus next (by excluding process names, or focusing on specific process IDs or process names, or similar).

lars-t-hansen commented 1 year ago

Consider a switch --breakdown= which (for jobs, anyway) has possible keyword arguments host, command. In the absence of this switch, jobs are rolled up across commands and hosts, as now, and output like this would be typical:

$ sonalyze jobs --job=281495 -f8d -b -- ../tmp/fox-logs/*.out
jobm     user        duration  host                                 cpu-avg  cpu-peak  mem-avg  mem-peak  gpu-avg  gpu-peak  gpumem-avg  gpumem-peak  cmd                                           
281495<  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  1216      1400       1        1         0        0         0           0            srun,slurm_script,sonarloop.sh,tsp_mpi,sonar  

This is a job that ran five commands on 14 hosts. But suppose we breakdown by command (this is simulated):

$ sonalyze jobs --job=281495 -f8d -b --breakdown=cmd --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job       host                                 cpu-avg  cpu-peak   cmd                                           
281495    c1-[5,8-10,12-14,20-21,23-24,26-28]  1216     1400       srun,slurm_script,sonarloop.sh,tsp_mpi,sonar  
* 281495  c1-[5,8-10,12-14,20-21,23-24,26-28]  1213     1395       tsp_mpi  
* 281495  c1-[5,8-10,12-14,20-21,23-24,26-28]     1        1       srun
* 281495  c1-[5,8-10,12-14,20-21,23-24,26-28]     1        1       slurm_script  
* 281495  c1-[5,8-10,12-14,20-21,23-24,26-28]     1        1       sonarloop.sh  
* 281495  c1-[5,8-10,12-14,20-21,23-24,26-28]     1        1       sonar  

or by host:

$ sonalyze jobs --job=281495 -f8d -b --breakdown=host --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job       host                                 cpu-avg  cpu-peak   cmd                                           
281495    c1-[5,8-10,12-14,20-21,23-24,26-28]  1216     1400       srun,slurm_script,sonarloop.sh,tsp_mpi,sonar  
* 281495  c1-5                                 100      100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495  c1-8                                 99       100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495  c1-9                                 99       100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* ...

or host first and command second (the other one is also an option):

$ sonalyze jobs --job=281495 -f8d -b --breakdown=host,cmd --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job       host                                 cpu-avg  cpu-peak   cmd                                           
281495    c1-[5,8-10,12-14,20-21,23-24,26-28]  1216     1400       srun,slurm_script,sonarloop.sh,tsp_mpi,sonar  
* 281495  c1-5                                 100      100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** 281495 c1-5                                 99       100        tsp_mpi
** 281495 c1-5                                 1        1          srun
** ...
* 281495  c1-8                                 99       100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** ...
* 281495  c1-9                                 99       100        srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** ...
* ...

(TODO: On the one hand there is --command, on the other hand the format specifier cmd, leading to confusion above. We should pick one and stick with it.)

lars-t-hansen commented 1 year ago

This is working on my branch w-breakdown, there just needs to be some better sorting. Specifically, consider this breakdown by command and host (which is going to be most important for analyzing load balancing):

$ target/debug/sonalyze jobs -f4w -u ec-larstha -b --breakdown=command,host --fmt=level,std,cpu,mem,cmd -- tmp/sonarlog-*.out
level  jobm     user        duration  host                                 cpu-avg  cpu-peak  mem-avg  mem-peak  cmd                                           
       281495<  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  1219     1400      2        2         sonarloop.sh,srun,sonar,slurm_script,tsp_mpi  
*      281495<  ec-larstha  0d 0h 2m  c1-5                                 3        21        1        1         slurm_script                                  
**     281495<  ec-larstha  0d 0h 2m  c1-5                                 3        21        1        1         slurm_script                                  
*      281495<  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-10                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-12                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-13                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-14                                0        0         1        1         sonar                                         
**     281495!  ec-larstha  0d 0h 2m  c1-20                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-21                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-23                                0        0         1        1         sonar                                         
**     281495!  ec-larstha  0d 0h 2m  c1-24                                0        0         1        1         sonar                                         
**     281495!  ec-larstha  0d 0h 2m  c1-26                                0        0         1        1         sonar                                         
**     281495!  ec-larstha  0d 0h 2m  c1-27                                0        0         1        1         sonar                                         
**     281495!  ec-larstha  0d 0h 2m  c1-28                                0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-5                                 0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-8                                 0        0         1        1         sonar                                         
**     281495<  ec-larstha  0d 0h 2m  c1-9                                 0        0         1        1         sonar                                         
*      281495<  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-10                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-12                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-13                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-14                                0        0         1        1         sonarloop.sh                                  
**     281495!  ec-larstha  0d 0h 2m  c1-20                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-21                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-23                                0        0         1        1         sonarloop.sh                                  
**     281495!  ec-larstha  0d 0h 2m  c1-24                                0        0         1        1         sonarloop.sh                                  
**     281495!  ec-larstha  0d 0h 2m  c1-26                                0        0         1        1         sonarloop.sh                                  
**     281495!  ec-larstha  0d 0h 2m  c1-27                                0        0         1        1         sonarloop.sh                                  
**     281495!  ec-larstha  0d 0h 2m  c1-28                                0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-5                                 0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-8                                 0        0         1        1         sonarloop.sh                                  
**     281495<  ec-larstha  0d 0h 2m  c1-9                                 0        0         1        1         sonarloop.sh                                  
*      281495<  ec-larstha  0d 0h 2m  c1-5                                 1        1         1        1         srun                                          
**     281495<  ec-larstha  0d 0h 2m  c1-5                                 1        1         1        1         srun                                          
*      281495<  ec-larstha  0d 0h 2m  c1-[5,8-10,12-14,20-21,23-24,26-28]  1216     1400      2        2         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-10                                89       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-12                                87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-13                                87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-14                                87       100       1        1         tsp_mpi                                       
**     281495!  ec-larstha  0d 0h 2m  c1-20                                87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-21                                87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-23                                87       100       1        1         tsp_mpi                                       
**     281495!  ec-larstha  0d 0h 2m  c1-24                                90       100       1        1         tsp_mpi                                       
**     281495!  ec-larstha  0d 0h 2m  c1-26                                86       100       1        1         tsp_mpi                                       
**     281495!  ec-larstha  0d 0h 2m  c1-27                                87       100       1        1         tsp_mpi                                       
**     281495!  ec-larstha  0d 0h 2m  c1-28                                87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-5                                 87       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-8                                 88       100       1        1         tsp_mpi                                       
**     281495<  ec-larstha  0d 0h 2m  c1-9                                 87       100       1        1         tsp_mpi                                       

Here we want the commands sorted in decreasing order of (average? peak?) cpu consumption probably (at least as a default) and then for each host probably sorted by decreasing (average? peak?) cpu usage. Actually, any predictable sort is better than none. And then hosts should be sorted by the normal host sorting order underneath that (currently lexicographic).

lars-t-hansen commented 1 year ago

Fixed along with #24, just now.