cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

IMPORTANT please test on cc27 if possible batch style jobs before add of additional nodes #414

Closed tatarsky closed 8 years ago

tatarsky commented 8 years ago

The Fuchs and SBIO groups have authorized the adding of their purchased nodes to the batch queue. This represents a considerable number of cores, ram, gpus and a Tesla card.

We have node cc27 offlined for test that the puppet process has all needed items on it. We are asking you validate there is possible ASAP so that it does not eat jobs. The unit also contains a Tesla and is marked as such in its properties (tesla)

While we intend to deploy some variations of the items in #407 we will repeat them separately as implemented. We will start with a goal of getting systems into the batch queue.

If no issues are found I will slowly add these nodes over the course of the day.

tatarsky commented 8 years ago

Going to add this node and others in an hour.

tatarsky commented 8 years ago

Adding. Report problems please to:


tatarsky commented 8 years ago

Gpu oversubscription not working right for cc27. Check rules to allow it even if all slots are filled batch.

tatarsky commented 8 years ago

OK. I think I fixed this now.

qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230098.hal-sched1.local to start
qsub: job 7230098.hal-sched1.local ready

[cc27 me ~]$ nvidia-smi
(stuff about the tesla)

Any tests that people could do to confirm this system is actually usable would be appreciated. It is currently usable I believe from gpu and batch queues.

jchodera commented 8 years ago

I don't get cc27 when I request tesla---I erroneously get a titan

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230614.hal-sched1.local to start
qsub: job 7230614.hal-sched1.local ready

[chodera@gpu-1-16 ~]$ nvidia-smi
Wed May 18 16:51:03 2016       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   34C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   31C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   37C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   37C    P8    14W / 250W |    295MiB /  6143MiB |      0%    E. Thread |

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    3     22437    C   ...grlab/home/karaletsos/anaconda/bin/python   278MiB |
[chodera@gpu-1-16 ~]$ 
akahles commented 8 years ago

I think you requested telsa instead of tesla so the label might have been ignored.

jchodera commented 8 years ago

Whoops---that's because I cut-and-paste a typo (telsa).

jchodera commented 8 years ago
[chodera@mskcc-ln1 ~]$ qsub -I -l notes=1:ppn=1:tesla:gpus=1 -l mem=1gb
qsub: submit error (Unknown resource type  Resource_List.notes)
akahles commented 8 years ago

Ok, sorry about that - I had assumed the label was meant to be tesla

jchodera commented 8 years ago

I think it's supposed to be tesla. I'm just not sure how to get it to work.

tatarsky commented 8 years ago

Your second example has "notes" instead of "nodes"

tatarsky commented 8 years ago

My test still seems to take me there:

qsub -l nodes=1:ppn=1:gpus=1:shared:tesla -q gpu -I
qsub: waiting for job 7230761.hal-sched1.local to start
qsub: job 7230761.hal-sched1.local ready

[me@cc27 ~]$ 
jchodera commented 8 years ago

Aha! In the queue now!

[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
tatarsky commented 8 years ago

Sorry, i exiting.

jchodera commented 8 years ago


[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
qsub: job 7230765.hal-sched1.local ready

[chodera@cc27 ~]$ nvidia-smi
Wed May 18 17:11:47 2016       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 23%   37C    P0    67W / 235W |     22MiB / 11519MiB |     99%    E. Thread |

| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|  No running processes found                                                 |
tatarsky commented 8 years ago

I show you on cc27 in "oversubscribe" slot "49" which is one more than the 48 batch jobs on there right now.

Am curious if the GPU infrastructure is right there though.

tatarsky commented 8 years ago

As in Torque GPU variables and such. Been awhile since I looked at all that.

jchodera commented 8 years ago

Testing on the GPU...

jchodera commented 8 years ago


[chodera@cc27 ~]$ setenv | grep CUDA_

It works!

[chodera@cc27 examples]$ python benchmark.py --help
python be   Usage: benchmark.py [options]

  -h, --help            show this help message and exit
  --platform=PLATFORM   name of the platform to benchmark
  --test=TEST           the test to perform: gbsa, rf, pme, amoebagk, or
                        amoebapme [default: all]
  --pme-cutoff=CUTOFF   direct space cutoff for PME in nm [default: 0.9]
  --seconds=SECONDS     target simulation length in seconds [default: 60]
                        the polarization method for AMOEBA: direct,
                        extrapolated, or mutual [default: mutual]
                        mutual induced epsilon for AMOEBA [default: 1e-5]
  --heavy-hydrogens     repartition mass to allow a larger time step
  --device=DEVICE       device index for CUDA or OpenCL
                        precision mode for CUDA or OpenCL: single, mixed, or
                        double [default: single]
[chodera@cc27 examples]$ python benchmark.py --platform=CUDA --test=pme --seconds=60
Platform: CUDA
Precision: single

Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 35138 steps in 60.0692 seconds
101.081 ns/day
tatarsky commented 8 years ago

OK cool. I need to validate that bash/sh CUDA_VISIBLE_DEVICES works. But I believe the queue infrastructure is now correct for batch and gpu.

tatarsky commented 8 years ago

Looked ok to me. Closing.