`sonalyze user` - another cpuhog use case

lars-t-hansen commented 1 year ago

Here's something new:

$ SONAR_ROOT=~/sonar/data target/release/sonalyze jobs -uabgani --no-gpu --fmt=std,start,end,cpu,mem,cmd
jobm     user    duration  host  start             end               cpu-avg  cpu-peak  mem-avg  mem-peak  cmd            
1901963  abgani  0d23h55m  ml6   2023-09-05 11:05  2023-09-06 11:00  1        2         1        1         jservergo      
2584754  abgani  0d 9h10m  ml6   2023-09-05 15:05  2023-09-06 00:15  1        35        1        1         python3        
2584814  abgani  0d 2h40m  ml6   2023-09-05 15:05  2023-09-05 17:45  100      100       1        1         glide_backend  
2584865  abgani  0d 2h30m  ml6   2023-09-05 15:05  2023-09-05 17:35  100      100       1        1         glide_backend  
2584916  abgani  0d 2h30m  ml6   2023-09-05 15:05  2023-09-05 17:35  100      100       1        1         glide_backend  
2584970  abgani  0d 3h 5m  ml6   2023-09-05 15:05  2023-09-05 18:10  100      100       1        1         glide_backend  
2585021  abgani  0d 4h25m  ml6   2023-09-05 15:05  2023-09-05 19:30  100      100       1        1         glide_backend  
2585072  abgani  0d 3h30m  ml6   2023-09-05 15:05  2023-09-05 18:35  100      100       1        1         glide_backend  
2585123  abgani  0d 4h25m  ml6   2023-09-05 15:05  2023-09-05 19:30  100      100       1        1         glide_backend  
2585186  abgani  0d 3h40m  ml6   2023-09-05 15:05  2023-09-05 18:45  100      100       1        1         glide_backend  
2585237  abgani  0d 1h45m  ml6   2023-09-05 15:05  2023-09-05 16:50  100      100       1        1         glide_backend  
2585288  abgani  0d 2h50m  ml6   2023-09-05 15:05  2023-09-05 17:55  100      100       1        1         glide_backend  
2585339  abgani  0d 3h20m  ml6   2023-09-05 15:05  2023-09-05 18:25  100      100       1        1         glide_backend  
2585390  abgani  0d 2h15m  ml6   2023-09-05 15:05  2023-09-05 17:20  100      100       1        1         glide_backend  
2585441  abgani  0d 6h 0m  ml6   2023-09-05 15:05  2023-09-05 21:05  100      100       1        1         glide_backend  
2585492  abgani  0d 2h 0m  ml6   2023-09-05 15:05  2023-09-05 17:05  100      100       1        1         glide_backend  
2585543  abgani  0d 1h25m  ml6   2023-09-05 15:05  2023-09-05 16:30  100      100       1        1         glide_backend  
2585594  abgani  0d 2h55m  ml6   2023-09-05 15:05  2023-09-05 18:00  100      100       1        1         glide_backend  
2585645  abgani  0d 1h50m  ml6   2023-09-05 15:05  2023-09-05 16:55  100      100       1        1         glide_backend  
2585699  abgani  0d 2h45m  ml6   2023-09-05 15:05  2023-09-05 17:50  100      100       1        1         glide_backend  
2585751  abgani  0d 1h45m  ml6   2023-09-05 15:05  2023-09-05 16:50  100      100       1        1         glide_backend  
2585802  abgani  0d 3h15m  ml6   2023-09-05 15:05  2023-09-05 18:20  100      100       1        1         glide_backend  
2586081  abgani  0d 2h20m  ml6   2023-09-05 15:05  2023-09-05 17:25  100      100       1        1         glide_backend  
2586132  abgani  0d 1h55m  ml6   2023-09-05 15:05  2023-09-05 17:00  100      100       1        1         glide_backend  
2586197  abgani  0d 1h25m  ml6   2023-09-05 15:05  2023-09-05 16:30  100      100       1        1         glide_backend  
2586248  abgani  0d 2h20m  ml6   2023-09-05 15:05  2023-09-05 17:25  100      100       1        1         glide_backend  
2586303  abgani  0d 1h50m  ml6   2023-09-05 15:05  2023-09-05 16:55  100      100       1        1         glide_backend  
2586357  abgani  0d 3h15m  ml6   2023-09-05 15:05  2023-09-05 18:20  100      100       1        1         glide_backend  
2586408  abgani  0d 4h30m  ml6   2023-09-05 15:05  2023-09-05 19:35  100      100       1        1         glide_backend  
2586460  abgani  0d 4h25m  ml6   2023-09-05 15:05  2023-09-05 19:30  100      100       1        1         glide_backend  
2586511  abgani  0d 2h50m  ml6   2023-09-05 15:05  2023-09-05 17:55  100      100       1        1         glide_backend  
2586562  abgani  0d 1h20m  ml6   2023-09-05 15:05  2023-09-05 16:25  100      100       1        1         glide_backend  
2586613  abgani  0d 3h50m  ml6   2023-09-05 15:05  2023-09-05 18:55  100      100       1        1         glide_backend  
2586664  abgani  0d 3h45m  ml6   2023-09-05 15:05  2023-09-05 18:50  100      100       1        1         glide_backend  
2598319  abgani  0d 3h55m  ml6   2023-09-05 16:30  2023-09-05 20:25  100      100       1        1         glide_backend  
2598783  abgani  0d 3h20m  ml6   2023-09-05 16:35  2023-09-05 19:55  100      100       1        1         glide_backend  
2599223  abgani  0d 2h20m  ml6   2023-09-05 16:40  2023-09-05 19:00  100      100       1        1         glide_backend  
2601159  abgani  0d 3h15m  ml6   2023-09-05 16:55  2023-09-05 20:10  100      100       1        1         glide_backend  
2601647  abgani  0d 4h10m  ml6   2023-09-05 17:00  2023-09-05 21:10  100      100       1        1         glide_backend  
2602237  abgani  0d 3h50m  ml6   2023-09-05 17:00  2023-09-05 20:50  100      100       1        1         glide_backend  
2602495  abgani  0d 4h35m  ml6   2023-09-05 17:00  2023-09-05 21:35  100      100       1        1         glide_backend  
2603247  abgani  0d 3h10m  ml6   2023-09-05 17:05  2023-09-05 20:15  100      100       1        1         glide_backend  
2603511  abgani  0d 3h 0m  ml6   2023-09-05 17:10  2023-09-05 20:10  100      100       1        1         glide_backend  
2605182  abgani  0d 2h 0m  ml6   2023-09-05 17:25  2023-09-05 19:25  100      100       1        1         glide_backend  
2606206  abgani  0d 3h10m  ml6   2023-09-05 17:30  2023-09-05 20:40  100      100       1        1         glide_backend  
2606414  abgani  0d 1h55m  ml6   2023-09-05 17:30  2023-09-05 19:25  100      100       1        1         glide_backend  
2607201  abgani  0d 2h30m  ml6   2023-09-05 17:40  2023-09-05 20:10  100      100       1        1         glide_backend  
2607367  abgani  0d 1h40m  ml6   2023-09-05 17:40  2023-09-05 19:20  100      100       1        1         glide_backend  
2609033  abgani  0d 2h20m  ml6   2023-09-05 17:55  2023-09-05 20:15  100      100       1        1         glide_backend  
2609623  abgani  0d 3h20m  ml6   2023-09-05 18:00  2023-09-05 21:20  100      100       1        1         glide_backend  
2610152  abgani  0d 6h15m  ml6   2023-09-05 18:00  2023-09-06 00:15  100      100       1        1         glide_backend  
2610429  abgani  0d 3h40m  ml6   2023-09-05 18:05  2023-09-05 21:45  100      100       1        1         glide_backend  
2611098  abgani  0d 2h25m  ml6   2023-09-05 18:10  2023-09-05 20:35  100      100       1        1         glide_backend  
2611856  abgani  0d 1h45m  ml6   2023-09-05 18:15  2023-09-05 20:00  100      100       1        1         glide_backend  
2613531  abgani  0d 1h35m  ml6   2023-09-05 18:30  2023-09-05 20:05  100      100       1        1         glide_backend  
2613673  abgani  0d 1h35m  ml6   2023-09-05 18:30  2023-09-05 20:05  100      100       1        1         glide_backend  
2614195  abgani  0d 2h50m  ml6   2023-09-05 18:30  2023-09-05 21:20  100      100       1        1         glide_backend  
2615641  abgani  0d 2h 0m  ml6   2023-09-05 18:45  2023-09-05 20:45  100      100       1        1         glide_backend  
2616889  abgani  0d 2h25m  ml6   2023-09-05 18:50  2023-09-05 21:15  100      100       1        1         glide_backend  
2617037  abgani  0d 2h55m  ml6   2023-09-05 18:55  2023-09-05 21:50  100      100       1        1         glide_backend  
2617789  abgani  0d 2h30m  ml6   2023-09-05 19:00  2023-09-05 21:30  100      100       1        1         glide_backend  
2618561  abgani  0d 1h10m  ml6   2023-09-05 19:05  2023-09-05 20:15  100      100       1        1         glide_backend  
2621387  abgani  0d 1h30m  ml6   2023-09-05 19:30  2023-09-05 21:00  100      100       1        1         glide_backend  
2621965  abgani  0d 2h15m  ml6   2023-09-05 19:30  2023-09-05 21:45  100      100       1        1         glide_backend  
2622179  abgani  0d 1h55m  ml6   2023-09-05 19:30  2023-09-05 21:25  100      100       1        1         glide_backend  
2622326  abgani  0d 2h35m  ml6   2023-09-05 19:35  2023-09-05 22:10  100      100       1        1         glide_backend

I need to look at this in some detail, but here we have a large number of jobs, each running for a few hours on one core, not using GPU, and overlapping significantly. See, for example, all the jobs started at 15:05. If these were all one job then it would fall under the rule of "cpuhog", ie, using a lot of CPU and no GPU. But because they are separate, they would not be seen by that search (the cpuhog search requires that at least 10% of the cpus be used).

Once cpuhog policies are up and running, there may be deliberate attempts at subverting it, and they may also look like this. It would be nice to have some way of detecting it either way.

(It would also be interesting to find out why these are all different jobs, but it could be legitimate. glide_backend appears to be some biotech code, see https://www.schrodinger.com/.)

lars-t-hansen commented 1 year ago

Possibly this gives rise to a 'user' verb to analyze a user's behavior across systems, like we have a 'jobs' verb to examine jobs across users and a 'load' verb to examine systems across jobs.

lars-t-hansen commented 11 months ago

This is also related to a use case UiT has, "rank the work of user X by load", and/or "rank system load by user".

lars-t-hansen commented 11 months ago

I see this type of (ab)use fairly frequently now.

lars-t-hansen commented 10 months ago

Something like this maybe:

$ ./sonalyze jobs -u- --no-gpu --fmt=awk,user,cmd,cputime/sec,cpu,mem,host | \
    awk '{ time[$1] += $3 } END { for (i in time) { print i, time[i] } }' | \
    sort -k 2nr
einarvid 2464587
annammc 1101219
tsauren 609000
daniehh 473049
jonaslsa 258318
niklase 224721
itf-ml-sw 140310
hermanno 87822
bendimol 34440
ksshawro 29109
magber 15624
krimhau 9951
mateuwa 8235
joachipo 7434
torsttho 6888
karths 6027
balintl 5715
haninm 3444
poyenyt 3444
ghadia 1722
ahmetyi 861
alsjur 861
adamjak 828
yanzho 432
sigurdkh 216
pubuduss 54

lars-t-hansen commented 10 months ago

Better:

$ ./sonalyze jobs -u- --no-gpu --fmt=awk,user,cmd,cputime/sec,cpu,mem,host | awk -f blame.awk | sort -k 4nr                                                                
einarvid : python3 2432859                                                                                                                                                                   
annammc : scripts.train 1096914                                                                                                                                                              
tsauren : linux_60.x86_64 603000                                                                                                                                                             
daniehh : python3 472188                                                                                                                                                                     
jonaslsa : python_<defunct> 258300                                                                                                                                                           
niklase : python 224721                                                                                                                                                                      
itf-ml-sw : bootstrap,cargo,cc1plus,gmake,python3,rust-installer,rustc 140283                                                                                                                
hermanno : Linus_FoodColle 83517                                                                                                                                                             
ksshawro : falconsense,java,meryl,perl,sh_<defunct> 35280                                                                                                                                    
bendimol : gcs_server,python,ray::CPUActor,ray::IDLE,raylet 34440                                                                                                                            
magber : python 15624                                                                                                                                                                        
joachipo : python3 9888                                                                                                                                                                      
krimhau : conda 9504                                                                                                                                                                         
mateuwa : jupyter-lab 7293                                                                                                                                                                   
torsttho : python3.9 6027                                                                                                                                                                    
balintl : watch 4041                                                                                                                                                                         
haninm : MATLAB 3444                                                                                                                                                                         
karths : python 3444                                                                                                                                                                         
poyenyt : wandb-service(2 3444                                                                                                                                                               
einarvid : mongod 3339                                                                                                                                                                       
annammc : scripts.preproc 2583                                                                                                                                                               
hermanno : kited 2583                                                                                                                                                                        
karths : jupyter-lab 2583                                                                                                                                                                    
hermanno : wandb-service(2 1722                                                                                                                                                              
balintl : top 1662                                                                                                                                                                           
ksshawro : conda 1218

for the script blame.awk:

{
  procs[$1 " : " $2] += $3
}
END {
  for (j in procs) {
    print j, procs[j]
  }
}

lars-t-hansen commented 10 months ago

I guess this is basically some type of policy: if a user's jobs that use no GPU use more than x% of a system's available CPU time under some period, then it's in violation of the policy. Possibly memory is equally important.

To be able to report this we need to know the jobs that go into each of the violations, too.

The "period" could be some sliding window, and we could set x at 10% - so if a clutch of non-gpu jobs together use more than 10% of the available CPU in some window then the user is in violation. 10% is a fairly high bar. It would catch the problem mentioned above, but somebody who is just dinking around on a couple cores at a time gets a pass, which is probably right. Per the use case this doesn't have a story yet, but we can call it a vampire I guess. I'll update the use case.

lars-t-hansen commented 8 months ago

Another success story (similar to https://github.com/NAICNO/Jobanalyzer/issues/55#issuecomment-1822939366):

target/release/sonalyze jobs \
  --auth-file ~/.ssh/sonalyzed-auth.txt \
  --cluster ml \
  --remote http://158.39.48.160:8087 \
  -u- \
  --fmt=awk,user,gputime/sec \
  -f16w \
  --host ml8 \
  --some-gpu | awk -f blame.awk | sort -k 2nr

where blame.awk is

{
  procs[$1] += $2
}
END {
  for (j in procs) {
    print j, procs[j]
  }
}

yields a report of gpu time by user for the last 16 weeks on ml8. This data in turn led to the decision to move ML8 into Fox to ensure more equitable use of the expensive A100 hardware.

This needs to be packaged up somehow but for a summary-across-jobs view it does not need to be part of sonalyze (though clearly it could be, given the very limited postprocessing needed). More likely we stick it into naicreport. And then naicreport will need a remoting capability soon.

lars-t-hansen commented 8 months ago

Use case captured as scripts/user-total-load.sh, since this was easier than pushing it into naicreport at the moment.

lars-t-hansen commented 8 months ago

Also added scripts/worst-violators.sh to list worst abusers of the system (most cputime in jobs that used no GPU).

lars-t-hansen commented 8 months ago

The problem with the characterization above is that there are legitimate peaks of high CPU usage such as compilation that would be caught by it. Something subtler might be needed. On the other hand, the existing cpuhog analysis probably triggers on a big compilation job too.

NAICNO / Jobanalyzer

`sonalyze user` - another cpuhog use case #55