cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

GTX titan nodes out of service? #396

Closed lzamparo closed 8 years ago

lzamparo commented 8 years ago

Hey,

I've had a job in the active queue that requires a GTX titan device (1 node, 1 core, 1 gtxtitan GPU) enqueued all day. Running qstat -f | grep gtxtitan -A10 -B 10 | less to manually track down which gpu nodes are in use shows 7 of 10 gtxtitan nodes are currently exclusively reserved for jobs in the gpu queue.

This should still leave 3 nodes and 12 GTX Titan GPUs for the rest of us, yet showstart claims I won't get in until Wednesday at 3am: $ showstart 7057672 job 7057672 requires 1 proc for 3:00:00

Estimated Rsv based start in 1:07:10:11 on Wed Mar 30 03:02:11 Estimated Rsv based completion in 1:10:10:11 on Wed Mar 30 06:02:11```

Are there GTX titan nodes that are not in service? Or am I just really unlucky now, and of the 11 jobs running in the gpu queue all are using GTX titan nodes?

tatarsky commented 8 years ago

You know there are more reasons to not get scheduled than your GPU request right?

tatarsky commented 8 years ago

I really recommend you review the checkjob -v -v -v 7057672 instead of just using showstart. It really tells you why your job is waiting and on what resources.

But, I just sent you an email about one Titan node I just restored to service following a job that really messed it up. But it contains item I don't wish in the public git for you to consider.

tatarsky commented 8 years ago

You also exposed username in the Public Git. I am removing it.

tatarsky commented 8 years ago

Per your email response note I JUST released gpu-2-14 for job running following its problems I described. More in your email in a second. Please check your jobs!

tatarsky commented 8 years ago

I show your GPU jobs on that node. Please validate you are running OK.

lzamparo commented 8 years ago

Apologies, I tried redacting now but maybe you're already editing it?

tatarsky commented 8 years ago

Already done ;)

lzamparo commented 8 years ago

interactive job is running fine, thanks again.

lzamparo commented 8 years ago

but my gpu queue jobs failed because nvidia-uvm is missing on that node (#389)

tatarsky commented 8 years ago

Hold on. Trying to see why.

tatarsky commented 8 years ago

Resubmit. I don't understand the chain of events that leads to that. I will have to debug it with the next node that goes down.

jchodera commented 8 years ago

Thanks, @tatarsky: Added your suggestions to the FAQ since I realized we didn't have checkjob up there anywhere.

tatarsky commented 8 years ago

Its briefly in https://github.com/cBio/cbio-cluster/wiki/Useful-torque-and-moab-commands-for-managing-batch-jobs. I find it the most useful for seeing what resources the job cannot get.

tatarsky commented 8 years ago

I will also mention @lzamparo an item perhaps not well known. There is "another titan" but its in gpu-2-5 and its Torque resource property is subtlely different as its a TEST card.

Its property is gtxtitanx (compared to gtxtitan)

Might help.

jchodera commented 8 years ago

Note gpu-2-5 has four of these gtxtitanx cards, not just one.

jchodera commented 8 years ago

Its briefly in https://github.com/cBio/cbio-cluster/wiki/Useful-torque-and-moab-commands-for-managing-batch-jobs. I find it the most useful for seeing what resources the job cannot get.

Added that link to the FAQ!

tatarsky commented 8 years ago

Ah, sorry good point. I meant that ;) Currently that node is booked solid with pure compute. But keep it in mind. If we do not wish the subtlety of the "x" at the end of the resource advise.

jchodera commented 8 years ago

Ah, sorry good point. I meant that ;) Currently that node is booked solid with pure compute.

I seem to recall that we allowed four overcommitted thread-slots per node for the gpu queue. I am not entirely sure if this was ever correctly implemented by SDSC, but it was definitely in our spec sheet for the configuration.

tatarsky commented 8 years ago

Hmm. I'd have to look for that and/or determine how to do it if its not working.

That won't happen this evening but I'll add it to the docket for tomorrow. That wouldn't change a job waiting on RAM however (I don't think anyway correct me if I am wrong as to the intended spec) which is also something I've seen quite tight and some of @lzamparo recent requests involved a fair chunk of ram.

lzamparo commented 8 years ago

About that: after those (high memory) jobs have finished, is there a way I can determine how much memory of what I requested was actually used? My original estimate for these heavy RAM jobs was an upper bound, maybe I don't need to run as much as that.

tatarsky commented 8 years ago

I will describe a few ways I do this in the morning.

lzamparo commented 8 years ago

Thanks. In the meantime, I've tried scraping the job number from the .o file produced, but neither tracejob nor checkjob succeeds; seems they cannot look far enough back in time.

When I manually grep the torque server logs for jobnumbers corresponding to my jobs, it seems as if they aren't using nearly the amount of memory I expected: Exit_status=0 resources_used.cput=00:00:00 resources_used.energy_used=0 resources_used.mem=14124kb resources_used.vmem=368672kb resources_used.walltime=20:27:35.

Seems the job only used a tiny amount of RAM. Am I interpreting this correctly? If so, probably it is due to the mini-batch data manager only grabbing a little bit of data from the hdf5 file at a time, transferring it to the device, training, update the parameters, repeat. Seems like the hdf5 module in lua is smart enough to avoid loading my whole data set into memory. If you can confirm, I'll adjust my torque scripts to not request such a large amount of memory, and apologies for any hassles caused.

tatarsky commented 8 years ago

tracejob needs to be told how many days back if more than one.

   -n : number of days in the past to look for job(s) [default 1]

More in the morning.

tatarsky commented 8 years ago

Per your item above (and if you put the job numbers in requests I can double check) I would agree that the job memory requests (13MB and 360MB) would qualify as small and you could try reducing that resource requirement to get at least that requirement lower for scheduling.

tatarsky commented 8 years ago

I am going to close this but make a new one for the concept mentioned of the overcommit on GPU queue. I can't locate any such construct in the config but am reviewing the steps I think would be required to do so. I will confirm with a ticket to Adaptive before deployment.