Closed tatarsky closed 8 years ago
Going to add this node and others in an hour.
Adding. Report problems please to:
hpc-request@cbio.mskcc.org
Gpu oversubscription not working right for cc27. Check rules to allow it even if all slots are filled batch.
OK. I think I fixed this now.
qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230098.hal-sched1.local to start
qsub: job 7230098.hal-sched1.local ready
[cc27 me ~]$ nvidia-smi
(stuff about the tesla)
Any tests that people could do to confirm this system is actually usable would be appreciated. It is currently usable I believe from gpu and batch queues.
I don't get cc27
when I request tesla
---I erroneously get a titan
[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1,mem=1gb:telsa:gpus=1 -q gpu
qsub: waiting for job 7230614.hal-sched1.local to start
qsub: job 7230614.hal-sched1.local ready
[chodera@gpu-1-16 ~]$ nvidia-smi
Wed May 18 16:51:03 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 Off | N/A |
| 30% 34C P8 13W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 Off | N/A |
| 30% 31C P8 12W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 Off | N/A |
| 30% 37C P8 12W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 Off | N/A |
| 30% 37C P8 14W / 250W | 295MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 22437 C ...grlab/home/karaletsos/anaconda/bin/python 278MiB |
+-----------------------------------------------------------------------------+
[chodera@gpu-1-16 ~]$
I think you requested telsa
instead of tesla
so the label might have been ignored.
Whoops---that's because I cut-and-paste a typo (telsa
).
[chodera@mskcc-ln1 ~]$ qsub -I -l notes=1:ppn=1:tesla:gpus=1 -l mem=1gb
qsub: submit error (Unknown resource type Resource_List.notes)
Ok, sorry about that - I had assumed the label was meant to be tesla
I think it's supposed to be tesla
. I'm just not sure how to get it to work.
Your second example has "notes" instead of "nodes"
My test still seems to take me there:
qsub -l nodes=1:ppn=1:gpus=1:shared:tesla -q gpu -I
qsub: waiting for job 7230761.hal-sched1.local to start
qsub: job 7230761.hal-sched1.local ready
[me@cc27 ~]$
Aha! In the queue now!
[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
Sorry, i exiting.
Success!
[chodera@mskcc-ln1 ~]$ qsub -I -l nodes=1:ppn=1:tesla:gpus=1 -l mem=1gb -q gpu
qsub: waiting for job 7230765.hal-sched1.local to start
qsub: job 7230765.hal-sched1.local ready
[chodera@cc27 ~]$ nvidia-smi
Wed May 18 17:11:47 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:81:00.0 Off | 0 |
| 23% 37C P0 67W / 235W | 22MiB / 11519MiB | 99% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I show you on cc27 in "oversubscribe" slot "49" which is one more than the 48 batch jobs on there right now.
Am curious if the GPU infrastructure is right there though.
As in Torque GPU variables and such. Been awhile since I looked at all that.
Testing on the GPU...
CUDA_VISIBLE_DEVICES
correct:
[chodera@cc27 ~]$ setenv | grep CUDA_
CUDA_VISIBLE_DEVICES=0
It works!
[chodera@cc27 examples]$ python benchmark.py --help
python be Usage: benchmark.py [options]
Options:
-h, --help show this help message and exit
--platform=PLATFORM name of the platform to benchmark
--test=TEST the test to perform: gbsa, rf, pme, amoebagk, or
amoebapme [default: all]
--pme-cutoff=CUTOFF direct space cutoff for PME in nm [default: 0.9]
--seconds=SECONDS target simulation length in seconds [default: 60]
--polarization=POLARIZATION
the polarization method for AMOEBA: direct,
extrapolated, or mutual [default: mutual]
--mutual-epsilon=EPSILON
mutual induced epsilon for AMOEBA [default: 1e-5]
--heavy-hydrogens repartition mass to allow a larger time step
--device=DEVICE device index for CUDA or OpenCL
--precision=PRECISION
precision mode for CUDA or OpenCL: single, mixed, or
double [default: single]
[chodera@cc27 examples]$ python benchmark.py --platform=CUDA --test=pme --seconds=60
Platform: CUDA
Precision: single
Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 35138 steps in 60.0692 seconds
101.081 ns/day
OK cool. I need to validate that bash/sh CUDA_VISIBLE_DEVICES works. But I believe the queue infrastructure is now correct for batch and gpu.
Looked ok to me. Closing.
The Fuchs and SBIO groups have authorized the adding of their purchased nodes to the batch queue. This represents a considerable number of cores, ram, gpus and a Tesla card.
We have node cc27 offlined for test that the puppet process has all needed items on it. We are asking you validate there is possible ASAP so that it does not eat jobs. The unit also contains a Tesla and is marked as such in its properties (
tesla
)While we intend to deploy some variations of the items in #407 we will repeat them separately as implemented. We will start with a goal of getting systems into the batch queue.
If no issues are found I will slowly add these nodes over the course of the day.