Closed lzamparo closed 8 years ago
You know there are more reasons to not get scheduled than your GPU request right?
I really recommend you review the checkjob -v -v -v 7057672
instead of just using showstart. It really tells you why your job is waiting and on what resources.
But, I just sent you an email about one Titan node I just restored to service following a job that really messed it up. But it contains item I don't wish in the public git for you to consider.
You also exposed username in the Public Git. I am removing it.
Per your email response note I JUST released gpu-2-14 for job running following its problems I described. More in your email in a second. Please check your jobs!
I show your GPU jobs on that node. Please validate you are running OK.
Apologies, I tried redacting now but maybe you're already editing it?
Already done ;)
interactive job is running fine, thanks again.
but my gpu queue jobs failed because nvidia-uvm is missing on that node (#389)
Hold on. Trying to see why.
Resubmit. I don't understand the chain of events that leads to that. I will have to debug it with the next node that goes down.
Thanks, @tatarsky: Added your suggestions to the FAQ since I realized we didn't have checkjob
up there anywhere.
Its briefly in https://github.com/cBio/cbio-cluster/wiki/Useful-torque-and-moab-commands-for-managing-batch-jobs. I find it the most useful for seeing what resources the job cannot get.
I will also mention @lzamparo an item perhaps not well known. There is "another titan" but its in gpu-2-5 and its Torque resource property is subtlely different as its a TEST card.
Its property is gtxtitanx
(compared to gtxtitan
)
Might help.
Note gpu-2-5
has four of these gtxtitanx
cards, not just one.
Its briefly in https://github.com/cBio/cbio-cluster/wiki/Useful-torque-and-moab-commands-for-managing-batch-jobs. I find it the most useful for seeing what resources the job cannot get.
Added that link to the FAQ!
Ah, sorry good point. I meant that ;) Currently that node is booked solid with pure compute. But keep it in mind. If we do not wish the subtlety of the "x" at the end of the resource advise.
Ah, sorry good point. I meant that ;) Currently that node is booked solid with pure compute.
I seem to recall that we allowed four overcommitted thread-slots per node for the gpu
queue. I am not entirely sure if this was ever correctly implemented by SDSC, but it was definitely in our spec sheet for the configuration.
Hmm. I'd have to look for that and/or determine how to do it if its not working.
That won't happen this evening but I'll add it to the docket for tomorrow. That wouldn't change a job waiting on RAM however (I don't think anyway correct me if I am wrong as to the intended spec) which is also something I've seen quite tight and some of @lzamparo recent requests involved a fair chunk of ram.
About that: after those (high memory) jobs have finished, is there a way I can determine how much memory of what I requested was actually used? My original estimate for these heavy RAM jobs was an upper bound, maybe I don't need to run as much as that.
I will describe a few ways I do this in the morning.
Thanks. In the meantime, I've tried scraping the job number from the .o file produced, but neither tracejob
nor checkjob
succeeds; seems they cannot look far enough back in time.
When I manually grep the torque server logs for jobnumbers corresponding to my jobs, it seems as if they aren't using nearly the amount of memory I expected: Exit_status=0 resources_used.cput=00:00:00 resources_used.energy_used=0 resources_used.mem=14124kb resources_used.vmem=368672kb resources_used.walltime=20:27:35
.
Seems the job only used a tiny amount of RAM. Am I interpreting this correctly? If so, probably it is due to the mini-batch data manager only grabbing a little bit of data from the hdf5 file at a time, transferring it to the device, training, update the parameters, repeat. Seems like the hdf5 module in lua is smart enough to avoid loading my whole data set into memory. If you can confirm, I'll adjust my torque scripts to not request such a large amount of memory, and apologies for any hassles caused.
tracejob needs to be told how many days back if more than one.
-n : number of days in the past to look for job(s) [default 1]
More in the morning.
Per your item above (and if you put the job numbers in requests I can double check) I would agree that the job memory requests (13MB and 360MB) would qualify as small and you could try reducing that resource requirement to get at least that requirement lower for scheduling.
I am going to close this but make a new one for the concept mentioned of the overcommit on GPU queue. I can't locate any such construct in the config but am reviewing the steps I think would be required to do so. I will confirm with a ticket to Adaptive before deployment.
Hey,
I've had a job in the active queue that requires a GTX titan device (1 node, 1 core, 1 gtxtitan GPU) enqueued all day. Running
qstat -f | grep gtxtitan -A10 -B 10 | less
to manually track down which gpu nodes are in use shows 7 of 10 gtxtitan nodes are currently exclusively reserved for jobs in the gpu queue.This should still leave 3 nodes and 12 GTX Titan GPUs for the rest of us, yet
showstart
claims I won't get in until Wednesday at 3am: $ showstart 7057672 job 7057672 requires 1 proc for 3:00:00Estimated Rsv based start in 1:07:10:11 on Wed Mar 30 03:02:11 Estimated Rsv based completion in 1:10:10:11 on Wed Mar 30 06:02:11```
Are there GTX titan nodes that are not in service? Or am I just really unlucky now, and of the 11 jobs running in the gpu queue all are using GTX titan nodes?