Open jchodera opened 8 years ago
Noted. I am trying to prepare for a queue modification to allow the Fuchs group purchased nodes to act as batch and gpu nodes when idle. There is a detail involving the unlimited wall time of the batch queue however that I was wanting to propose a change for but is being reviewed.
One item I could start if you had a moment is if I offline one (as its idle) can you validate your code works properly on one of them via manual SSH?
I could do that now!
OK. Please ssh manually to gg06. It has two GTX Titans.
Looks like you have the wrong CUDA version installed as default:
[chodera@mskcc-ln1 ~]$ which nvcc
/usr/local/cuda-7.5//bin/nvcc
[chodera@mskcc-ln1 ~]$ ssh gg06
Last login: Fri Apr 29 11:27:20 2016 from mskcc-ln1.fast
[chodera@gg06 ~]$ which nvcc
nvcc: Command not found.
[chodera@gg06 ~]$ ls -ltr /usr/local/cuda
lrwxrwxrwx 1 root root 19 Mar 10 15:13 /usr/local/cuda -> /usr/local/cuda-7.0
[```
One second that is correct.
OK. Fixed some rules. Try again.
Hmm. I'm actually showing a regression somewhere on this topic of the default /usr/local/cuda symlink. I'm checking into it now.
OK. I believe that is correct everywhere now. I noted a few were 7-0 and I'm not sure why. I am investigating.
Found rule issue and believed now fixed correctly. I will double check after next puppet run but please continue to test as desired on gg06
. When the review of the concerns about unlimited batch walltime on these nodes by other groups is addressed I will update everyone via a separate Git.
Seems to work now. Thanks!
OK. I will get an update on the ruling for making these nodes able to handle overflow. Thank you for testing. I will likely announce a general "batch" test on this node as well.
Please note gg06 back in the queue. I believe you got the data we need to proceed with the process but they have an important deadline.
So I've been watching this and while we don't still have full agreement on how to share the added nodes, I am paying attention to the average Qtime.
Its currently down to:
Avg/Max QTime (Hours): 6.34/351.68
Work continues on the policies/config to allow the groups that purchased additional nodes. I'm currently making use of standing reservations when a deadline is upon the group in question.
I am however leaving this open until I get a better final statement on some of those sharing policies.
I feel sorry for the poor sap who was waiting 351.68 hours (15 days) for their jobs to start...
So I wrote a script to try to analyze the job logs for that waittime and I cannot locate said job shown there as max. Longest I see is actually a gpu job of yours back on 5/16 which was in queue for 115 hours. Still not good but I can't find that one.
Which means my script is probably wrong, but I'm trying to analyze for @juanperin the resource shortages in terms of the discussions going on to share the added nodes.
This is getting to be pretty long: