Open jchodera opened 9 years ago
Are you sure you mean gpu-2-8? nvidia-smi shows four of the GTX 680.
GeForce GTX 680
The GTX 980 cards are in gpu-2-6.
Please confirm with nvidia-smi as well just to make sure we offline the correct node.
Snippet from nodes file as well to show the card types in that group of nodes:
gpu-2-4 np=32 gpus=4 batch gtx780ti nv352
gpu-2-5 np=32 gpus=4 batch gtxtitanx nv352
gpu-2-6 np=32 gpus=4 batch gtx980 nv352 <-------- I believe you want this node but double confirm with nvidia-smi
gpu-2-7 np=32 gpus=4 batch gtx680 nv352
gpu-2-8 np=32 gpus=4 batch gtx680 nv352
Yep, gpu-2-8
was a typo. I meant gpu-2-6
.
I've placed a reservation on the GPU resources on gpu-2-6
. When I see them come free I will update the driver.
No GPU activity was seen. Updated driver.
+------------------------------------------------------+
| NVIDIA-SMI 355.11 Driver Version: 355.11 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Off | 0000:03:00.0 Off | N/A |
| 26% 36C P0 47W / 180W | 14MiB / 4095MiB | 0% Default |
I still have the reservation in place on the GPUS however. Do you wish to test manually first in case roll back is desired?
Yes, will test in the morning (Frankfurt time). Thanks!
No prob. Reservation left in place for GPUs. Batch jobs non-impacted.
gpu-2-6 is drained from discussions elsewhere. Did that driver update work out? I can re-add it to the batch queue and re-issue the GPU only reservation if desired.
My apologies for not having much time to further debug. There appears to still be something weird going on with the GPU configuration. Will provide more info in next email. On Oct 22, 2015 5:00 PM, "tatarsky" notifications@github.com wrote:
gpu-2-6 is drained from discussions elsewhere. Did that driver update work out? I can re-add it to the batch queue and re-issue the GPU only reservation if desired.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/331#issuecomment-150253623.
OK. I put the node back in the pool for batch work but stuck a 10 day reservation on the GPUs. Hope that is reasonable.
I believe I need to renew the GPU reservation on this node. Done for another 10 days.
Thanks. We're still chasing this down, and have replicated the issue on a local dev box. It seems to be 980-specific and related to driver versions.
@pgrinaway and @steven-albanese have been investigating on the local dev box.
Fun! Noted.
We are having some trouble using the FAH client application on
gpu-2-8
(where the new GTX-980 cards were installed), and the advice we have received is to upgrade the 352.39 driver to 355.11 or later. Would it be possible to drain this node of GPU jobs and test the upgrade when feasible?I believe the 355.11 driver is available here: http://www.nvidia.com/download/driverResults.aspx/90393/en-us