cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

[NOT URGENT] Test upgraded NVIDIA driver in gpu-2-6 #331

Open jchodera opened 9 years ago

jchodera commented 9 years ago

We are having some trouble using the FAH client application on gpu-2-8 (where the new GTX-980 cards were installed), and the advice we have received is to upgrade the 352.39 driver to 355.11 or later. Would it be possible to drain this node of GPU jobs and test the upgrade when feasible?

I believe the 355.11 driver is available here: http://www.nvidia.com/download/driverResults.aspx/90393/en-us

tatarsky commented 9 years ago

Are you sure you mean gpu-2-8? nvidia-smi shows four of the GTX 680.

GeForce GTX 680

The GTX 980 cards are in gpu-2-6.

Please confirm with nvidia-smi as well just to make sure we offline the correct node.

tatarsky commented 9 years ago

Snippet from nodes file as well to show the card types in that group of nodes:

gpu-2-4 np=32 gpus=4 batch gtx780ti nv352
gpu-2-5 np=32 gpus=4 batch gtxtitanx nv352
gpu-2-6 np=32 gpus=4 batch gtx980 nv352    <-------- I believe you want this node but double confirm with nvidia-smi
gpu-2-7 np=32 gpus=4 batch gtx680 nv352
gpu-2-8 np=32 gpus=4 batch gtx680 nv352
jchodera commented 9 years ago

Yep, gpu-2-8 was a typo. I meant gpu-2-6.

tatarsky commented 9 years ago

I've placed a reservation on the GPU resources on gpu-2-6. When I see them come free I will update the driver.

tatarsky commented 9 years ago

No GPU activity was seen. Updated driver.

+------------------------------------------------------+                       
| NVIDIA-SMI 355.11     Driver Version: 355.11         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:03:00.0     Off |                  N/A |
| 26%   36C    P0    47W / 180W |     14MiB /  4095MiB |      0%      Default |

I still have the reservation in place on the GPUS however. Do you wish to test manually first in case roll back is desired?

jchodera commented 9 years ago

Yes, will test in the morning (Frankfurt time). Thanks!

tatarsky commented 9 years ago

No prob. Reservation left in place for GPUs. Batch jobs non-impacted.

tatarsky commented 9 years ago

gpu-2-6 is drained from discussions elsewhere. Did that driver update work out? I can re-add it to the batch queue and re-issue the GPU only reservation if desired.

jchodera commented 9 years ago

My apologies for not having much time to further debug. There appears to still be something weird going on with the GPU configuration. Will provide more info in next email. On Oct 22, 2015 5:00 PM, "tatarsky" notifications@github.com wrote:

gpu-2-6 is drained from discussions elsewhere. Did that driver update work out? I can re-add it to the batch queue and re-issue the GPU only reservation if desired.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/331#issuecomment-150253623.

tatarsky commented 9 years ago

OK. I put the node back in the pool for batch work but stuck a 10 day reservation on the GPUs. Hope that is reasonable.

tatarsky commented 9 years ago

I believe I need to renew the GPU reservation on this node. Done for another 10 days.

jchodera commented 9 years ago

Thanks. We're still chasing this down, and have replicated the issue on a local dev box. It seems to be 980-specific and related to driver versions.

@pgrinaway and @steven-albanese have been investigating on the local dev box.

tatarsky commented 9 years ago

Fun! Noted.