Closed lars-t-hansen closed 1 year ago
Consider a switch --breakdown=
which (for jobs, anyway) has possible keyword arguments host
, command
. In the absence of this switch, jobs are rolled up across commands and hosts, as now, and output like this would be typical:
$ sonalyze jobs --job=281495 -f8d -b -- ../tmp/fox-logs/*.out
jobm user duration host cpu-avg cpu-peak mem-avg mem-peak gpu-avg gpu-peak gpumem-avg gpumem-peak cmd
281495< ec-larstha 0d 0h 2m c1-[5,8-10,12-14,20-21,23-24,26-28] 1216 1400 1 1 0 0 0 0 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
This is a job that ran five commands on 14 hosts. But suppose we breakdown by command (this is simulated):
$ sonalyze jobs --job=281495 -f8d -b --breakdown=cmd --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job host cpu-avg cpu-peak cmd
281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1216 1400 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1213 1395 tsp_mpi
* 281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1 1 srun
* 281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1 1 slurm_script
* 281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1 1 sonarloop.sh
* 281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1 1 sonar
or by host:
$ sonalyze jobs --job=281495 -f8d -b --breakdown=host --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job host cpu-avg cpu-peak cmd
281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1216 1400 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495 c1-5 100 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495 c1-8 99 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495 c1-9 99 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* ...
or host first and command second (the other one is also an option):
$ sonalyze jobs --job=281495 -f8d -b --breakdown=host,cmd --fmt=header,job,host,cpu,cmd -- ../tmp/fox-logs/*.out
job host cpu-avg cpu-peak cmd
281495 c1-[5,8-10,12-14,20-21,23-24,26-28] 1216 1400 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
* 281495 c1-5 100 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** 281495 c1-5 99 100 tsp_mpi
** 281495 c1-5 1 1 srun
** ...
* 281495 c1-8 99 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** ...
* 281495 c1-9 99 100 srun,slurm_script,sonarloop.sh,tsp_mpi,sonar
** ...
* ...
(TODO: On the one hand there is --command
, on the other hand the format specifier cmd
, leading to confusion above. We should pick one and stick with it.)
This is working on my branch w-breakdown, there just needs to be some better sorting. Specifically, consider this breakdown by command and host (which is going to be most important for analyzing load balancing):
$ target/debug/sonalyze jobs -f4w -u ec-larstha -b --breakdown=command,host --fmt=level,std,cpu,mem,cmd -- tmp/sonarlog-*.out
level jobm user duration host cpu-avg cpu-peak mem-avg mem-peak cmd
281495< ec-larstha 0d 0h 2m c1-[5,8-10,12-14,20-21,23-24,26-28] 1219 1400 2 2 sonarloop.sh,srun,sonar,slurm_script,tsp_mpi
* 281495< ec-larstha 0d 0h 2m c1-5 3 21 1 1 slurm_script
** 281495< ec-larstha 0d 0h 2m c1-5 3 21 1 1 slurm_script
* 281495< ec-larstha 0d 0h 2m c1-[5,8-10,12-14,20-21,23-24,26-28] 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-10 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-12 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-13 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-14 0 0 1 1 sonar
** 281495! ec-larstha 0d 0h 2m c1-20 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-21 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-23 0 0 1 1 sonar
** 281495! ec-larstha 0d 0h 2m c1-24 0 0 1 1 sonar
** 281495! ec-larstha 0d 0h 2m c1-26 0 0 1 1 sonar
** 281495! ec-larstha 0d 0h 2m c1-27 0 0 1 1 sonar
** 281495! ec-larstha 0d 0h 2m c1-28 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-5 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-8 0 0 1 1 sonar
** 281495< ec-larstha 0d 0h 2m c1-9 0 0 1 1 sonar
* 281495< ec-larstha 0d 0h 2m c1-[5,8-10,12-14,20-21,23-24,26-28] 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-10 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-12 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-13 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-14 0 0 1 1 sonarloop.sh
** 281495! ec-larstha 0d 0h 2m c1-20 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-21 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-23 0 0 1 1 sonarloop.sh
** 281495! ec-larstha 0d 0h 2m c1-24 0 0 1 1 sonarloop.sh
** 281495! ec-larstha 0d 0h 2m c1-26 0 0 1 1 sonarloop.sh
** 281495! ec-larstha 0d 0h 2m c1-27 0 0 1 1 sonarloop.sh
** 281495! ec-larstha 0d 0h 2m c1-28 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-5 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-8 0 0 1 1 sonarloop.sh
** 281495< ec-larstha 0d 0h 2m c1-9 0 0 1 1 sonarloop.sh
* 281495< ec-larstha 0d 0h 2m c1-5 1 1 1 1 srun
** 281495< ec-larstha 0d 0h 2m c1-5 1 1 1 1 srun
* 281495< ec-larstha 0d 0h 2m c1-[5,8-10,12-14,20-21,23-24,26-28] 1216 1400 2 2 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-10 89 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-12 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-13 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-14 87 100 1 1 tsp_mpi
** 281495! ec-larstha 0d 0h 2m c1-20 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-21 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-23 87 100 1 1 tsp_mpi
** 281495! ec-larstha 0d 0h 2m c1-24 90 100 1 1 tsp_mpi
** 281495! ec-larstha 0d 0h 2m c1-26 86 100 1 1 tsp_mpi
** 281495! ec-larstha 0d 0h 2m c1-27 87 100 1 1 tsp_mpi
** 281495! ec-larstha 0d 0h 2m c1-28 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-5 87 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-8 88 100 1 1 tsp_mpi
** 281495< ec-larstha 0d 0h 2m c1-9 87 100 1 1 tsp_mpi
Here we want the commands sorted in decreasing order of (average? peak?) cpu consumption probably (at least as a default) and then for each host probably sorted by decreasing (average? peak?) cpu usage. Actually, any predictable sort is better than none. And then hosts should be sorted by the normal host sorting order underneath that (currently lexicographic).
Fixed along with #24, just now.
Some of the behavior in #45 (now fixed) is mysterious because multiple processes are rolled together into a job. Sometimes it is desirable to break a multi-process job down by process, in the same way that a multi-host job can be broken down by host: only this way can we more easily see where to focus next (by excluding process names, or focusing on specific process IDs or process names, or similar).