Jobstats is a free and open-source job monitoring platform designed for CPU and GPU clusters that use the Slurm workload manager.
The jobstats
command provides users with a job efficiency report:
$ jobstats 39798795
================================================================================
Slurm Job Statistics
================================================================================
Job ID: 39798795
NetID/Account: aturing/math
Job Name: sys_logic_ordinals
State: COMPLETED
Nodes: 2
CPU Cores: 48
CPU Memory: 256GB (5.3GB per CPU-core)
GPUs: 4
QOS/Partition: della-gpu/gpu
Cluster: della
Start Time: Fri Mar 4, 2022 at 1:56 AM
Run Time: 18:41:56
Time Limit: 4-00:00:00
Overall Utilization
================================================================================
CPU utilization [||||| 10%]
CPU memory usage [||| 6%]
GPU utilization [|||||||||||||||||||||||||||||||||| 68%]
GPU memory usage [||||||||||||||||||||||||||||||||| 66%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%)
della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%)
Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9%
CPU memory usage per node - used/allocated
della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24)
della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24)
Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48)
GPU utilization per node
della-i14g2 (GPU 0): 65.7%
della-i14g2 (GPU 1): 64.5%
della-i14g3 (GPU 0): 72.9%
della-i14g3 (GPU 1): 67.5%
GPU memory usage per node - maximum used/total
della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%)
della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%)
Notes
================================================================================
* This job only used 6% of the 256GB of total allocated CPU memory. For
future jobs, please allocate less memory by using a Slurm directive such
as --mem-per-cpu=1G or --mem=10G. This will reduce your queue times and
make the resources available to other users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/memory
* This job only needed 19% of the requested time which was 4-00:00:00. For
future jobs, please request less time by modifying the --time Slurm
directive. This will lower your queue times and allow the Slurm job
scheduler to work more effectively for all users. For more info:
https://researchcomputing.princeton.edu/support/knowledge-base/slurm
* For additional job metrics including metrics plotted against time:
https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus)
Begin with What is Jobstats? in the documentation.