cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Average queue wait time is now over 10 hours! #405

Open jchodera opened 8 years ago

jchodera commented 8 years ago

This is getting to be pretty long:

[chodera@mskcc-ln1 ~/scripts]$ showstats

moab active for   14:00:12:17  stats initialized on Tue Mar  8 12:16:45 2016

Eligible/Idle Jobs:              2102/2102   (100.000%)
Active Jobs:                      673
Successful/Completed Jobs:     235236/235236 (100.000%)
Avg/Max QTime (Hours):          10.82/351.68
Avg/Max XFactor:                 0.13/704.36

Dedicated/Total ProcHours:      1.34M/3.97M  (33.660%)

Current Active/Total Procs:      1706/3344   (51.017%)

Avg WallClock Accuracy:          10.335%
Avg Job Proc Efficiency:         66.654%
Est/Avg Backlog:                19:37:00/1:11:55:15 
tatarsky commented 8 years ago

Noted. I am trying to prepare for a queue modification to allow the Fuchs group purchased nodes to act as batch and gpu nodes when idle. There is a detail involving the unlimited wall time of the batch queue however that I was wanting to propose a change for but is being reviewed.

One item I could start if you had a moment is if I offline one (as its idle) can you validate your code works properly on one of them via manual SSH?

jchodera commented 8 years ago

I could do that now!

tatarsky commented 8 years ago

OK. Please ssh manually to gg06. It has two GTX Titans.

jchodera commented 8 years ago

Looks like you have the wrong CUDA version installed as default:

[chodera@mskcc-ln1 ~]$ which nvcc
/usr/local/cuda-7.5//bin/nvcc
[chodera@mskcc-ln1 ~]$ ssh gg06
Last login: Fri Apr 29 11:27:20 2016 from mskcc-ln1.fast
[chodera@gg06 ~]$ which nvcc
nvcc: Command not found.
[chodera@gg06 ~]$ ls -ltr /usr/local/cuda
lrwxrwxrwx 1 root root 19 Mar 10 15:13 /usr/local/cuda -> /usr/local/cuda-7.0
[```
tatarsky commented 8 years ago

One second that is correct.

tatarsky commented 8 years ago

OK. Fixed some rules. Try again.

tatarsky commented 8 years ago

Hmm. I'm actually showing a regression somewhere on this topic of the default /usr/local/cuda symlink. I'm checking into it now.

tatarsky commented 8 years ago

OK. I believe that is correct everywhere now. I noted a few were 7-0 and I'm not sure why. I am investigating.

tatarsky commented 8 years ago

Found rule issue and believed now fixed correctly. I will double check after next puppet run but please continue to test as desired on gg06. When the review of the concerns about unlimited batch walltime on these nodes by other groups is addressed I will update everyone via a separate Git.

jchodera commented 8 years ago

Seems to work now. Thanks!

tatarsky commented 8 years ago

OK. I will get an update on the ruling for making these nodes able to handle overflow. Thank you for testing. I will likely announce a general "batch" test on this node as well.

tatarsky commented 8 years ago

Please note gg06 back in the queue. I believe you got the data we need to proceed with the process but they have an important deadline.

tatarsky commented 8 years ago

So I've been watching this and while we don't still have full agreement on how to share the added nodes, I am paying attention to the average Qtime.

Its currently down to:

Avg/Max QTime (Hours):           6.34/351.68

Work continues on the policies/config to allow the groups that purchased additional nodes. I'm currently making use of standing reservations when a deadline is upon the group in question.

I am however leaving this open until I get a better final statement on some of those sharing policies.

jchodera commented 8 years ago

I feel sorry for the poor sap who was waiting 351.68 hours (15 days) for their jobs to start...

tatarsky commented 8 years ago

So I wrote a script to try to analyze the job logs for that waittime and I cannot locate said job shown there as max. Longest I see is actually a gpu job of yours back on 5/16 which was in queue for 115 hours. Still not good but I can't find that one.

Which means my script is probably wrong, but I'm trying to analyze for @juanperin the resource shortages in terms of the discussions going on to share the added nodes.