Open lars-t-hansen opened 1 year ago
Possibly this gives rise to a 'user' verb to analyze a user's behavior across systems, like we have a 'jobs' verb to examine jobs across users and a 'load' verb to examine systems across jobs.
This is also related to a use case UiT has, "rank the work of user X by load", and/or "rank system load by user".
I see this type of (ab)use fairly frequently now.
Something like this maybe:
$ ./sonalyze jobs -u- --no-gpu --fmt=awk,user,cmd,cputime/sec,cpu,mem,host | \
awk '{ time[$1] += $3 } END { for (i in time) { print i, time[i] } }' | \
sort -k 2nr
einarvid 2464587
annammc 1101219
tsauren 609000
daniehh 473049
jonaslsa 258318
niklase 224721
itf-ml-sw 140310
hermanno 87822
bendimol 34440
ksshawro 29109
magber 15624
krimhau 9951
mateuwa 8235
joachipo 7434
torsttho 6888
karths 6027
balintl 5715
haninm 3444
poyenyt 3444
ghadia 1722
ahmetyi 861
alsjur 861
adamjak 828
yanzho 432
sigurdkh 216
pubuduss 54
Better:
$ ./sonalyze jobs -u- --no-gpu --fmt=awk,user,cmd,cputime/sec,cpu,mem,host | awk -f blame.awk | sort -k 4nr
einarvid : python3 2432859
annammc : scripts.train 1096914
tsauren : linux_60.x86_64 603000
daniehh : python3 472188
jonaslsa : python_<defunct> 258300
niklase : python 224721
itf-ml-sw : bootstrap,cargo,cc1plus,gmake,python3,rust-installer,rustc 140283
hermanno : Linus_FoodColle 83517
ksshawro : falconsense,java,meryl,perl,sh_<defunct> 35280
bendimol : gcs_server,python,ray::CPUActor,ray::IDLE,raylet 34440
magber : python 15624
joachipo : python3 9888
krimhau : conda 9504
mateuwa : jupyter-lab 7293
torsttho : python3.9 6027
balintl : watch 4041
haninm : MATLAB 3444
karths : python 3444
poyenyt : wandb-service(2 3444
einarvid : mongod 3339
annammc : scripts.preproc 2583
hermanno : kited 2583
karths : jupyter-lab 2583
hermanno : wandb-service(2 1722
balintl : top 1662
ksshawro : conda 1218
for the script blame.awk:
{
procs[$1 " : " $2] += $3
}
END {
for (j in procs) {
print j, procs[j]
}
}
I guess this is basically some type of policy: if a user's jobs that use no GPU use more than x% of a system's available CPU time under some period, then it's in violation of the policy. Possibly memory is equally important.
To be able to report this we need to know the jobs that go into each of the violations, too.
The "period" could be some sliding window, and we could set x at 10% - so if a clutch of non-gpu jobs together use more than 10% of the available CPU in some window then the user is in violation. 10% is a fairly high bar. It would catch the problem mentioned above, but somebody who is just dinking around on a couple cores at a time gets a pass, which is probably right. Per the use case this doesn't have a story yet, but we can call it a vampire
I guess. I'll update the use case.
Another success story (similar to https://github.com/NAICNO/Jobanalyzer/issues/55#issuecomment-1822939366):
target/release/sonalyze jobs \
--auth-file ~/.ssh/sonalyzed-auth.txt \
--cluster ml \
--remote http://158.39.48.160:8087 \
-u- \
--fmt=awk,user,gputime/sec \
-f16w \
--host ml8 \
--some-gpu | awk -f blame.awk | sort -k 2nr
where blame.awk is
{
procs[$1] += $2
}
END {
for (j in procs) {
print j, procs[j]
}
}
yields a report of gpu time by user for the last 16 weeks on ml8. This data in turn led to the decision to move ML8 into Fox to ensure more equitable use of the expensive A100 hardware.
This needs to be packaged up somehow but for a summary-across-jobs view it does not need to be part of sonalyze (though clearly it could be, given the very limited postprocessing needed). More likely we stick it into naicreport. And then naicreport will need a remoting capability soon.
Use case captured as scripts/user-total-load.sh
, since this was easier than pushing it into naicreport at the moment.
Also added scripts/worst-violators.sh
to list worst abusers of the system (most cputime in jobs that used no GPU).
The problem with the characterization above is that there are legitimate peaks of high CPU usage such as compilation that would be caught by it. Something subtler might be needed. On the other hand, the existing cpuhog analysis probably triggers on a big compilation job too.
Here's something new:
I need to look at this in some detail, but here we have a large number of jobs, each running for a few hours on one core, not using GPU, and overlapping significantly. See, for example, all the jobs started at 15:05. If these were all one job then it would fall under the rule of "cpuhog", ie, using a lot of CPU and no GPU. But because they are separate, they would not be seen by that search (the cpuhog search requires that at least 10% of the cpus be used).
Once cpuhog policies are up and running, there may be deliberate attempts at subverting it, and they may also look like this. It would be nice to have some way of detecting it either way.
(It would also be interesting to find out why these are all different jobs, but it could be legitimate. glide_backend appears to be some biotech code, see https://www.schrodinger.com/.)