cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Please help validate a set of five newish compute only nodes #397

Closed tatarsky closed 8 years ago

tatarsky commented 8 years ago

Per discussion with @juanperin a set of five HP DL160 Gen9 units that were tasked from another location over last December to assist another researcher who recently gained his own nodes are being prepped to be added to batch to improve non-GPU requiring processor counts.

They are compute only. (No GPUS). 48 thread slots and 256GB of ram.

I added them originally fairly quickly using a post-ROCKS puppet method that worked for the specific researcher in question but he was the only "client" as it were. They worked for his needs and my basic tests show items in place.

But in order to validate them a bit more for a wider audience I would ask the following of those that would find this interesting to make sure I do not introduce nodes with missing items to the batch queue which is the default for ALL users.

This test will be for a few days.

It would be nice to relieve some of the pressure on GPU containing nodes from compute only jobs.

Thank you.

tatarsky commented 8 years ago

Note these units will be added tomorrow unless I hear they are missing software.

tatarsky commented 8 years ago

@akahles in particular I want you to be aware of this as I believe many of your waiting jobs would run on these machines if I add them tomorrow morning. (assuming of course they are still waiting)

akahles commented 8 years ago

Currently traveling with irregular access to internet. Had a quick look at the node and seems to be fine. My jobs use mostly python that uses my own anaconda. As long as my home is mounted, it should be fine :)

tatarsky commented 8 years ago

Gotcha. Just didn't want to surprise you. Safe travels.

akahles commented 8 years ago

Thanks for the heads up anyways!

lzamparo commented 8 years ago

I'll try some compute & memory heavy R jobs.

tatarsky commented 8 years ago

Sounds good. gpu-2-8 I was hoping had some Titans it it for you but its GTX-680s. We managed to get that unit repaired today and I'm validating it with a new health check.

lzamparo commented 8 years ago

No worries, with gpu-2-14 and gpu-2-5 in service I should be ok.

tatarsky commented 8 years ago

OK.

tatarsky commented 8 years ago

Most likely I will add these nodes at around 2:00PM today as I will have a nice clear section of my day to listen for any issues. If you are still manually performing tests I'll hold off.

lzamparo commented 8 years ago

My tests are still running, but I'll have to abort anyway, as there's an error I've detected. I'll kill my jobs and rewrite them for submission to the batch queue.

tatarsky commented 8 years ago

Performing an initial test of just adding cc01 to batch.

tatarsky commented 8 years ago

Some jobs appear running there. Will wait for a bit to monitor for any issues.

tatarsky commented 8 years ago

cc02 added. cc03/cc04/cc05 in a moment.

tatarsky commented 8 years ago

@akahles just a heads up some of your jobs are running on cc02. Look ok to me but thats just from a process table view.

tatarsky commented 8 years ago

cc03/cc04/cc05 now in batch as well. Watching for a bit to make sure I believe they are not eating jobs but then will close this.

tatarsky commented 8 years ago

Units appear to be processing jobs. Closing for now.