Closed jchodera closed 10 years ago
Noted it as well. I will get when I am back from D.C.
Will attempt in morning. Does notice need to be given in case of some situation where it interferes with 6.0 or something? (Seems unlikely....but just in case of very important jobs I'm asking)
If this is installed as a module under /usr/local/cuda-6.5/
, there should be no problem.
If there is a driver update required, that may cause running code to die, but that should be OK---I believe all our jobs can be recovered and resumed easily.
I'll try it in a VM to answer the above.
It does look like drivers are included: https://developer.nvidia.com/cuda-downloads
Q: Are the latest NVIDIA drivers included in the CUDA Toolkit installers?
A: For convenience, the installer packages on this page include NVIDIA drivers which support application development for all CUDA-capable GPUs supported by this release of the CUDA Toolkit. If you are deploying applications on NVIDIA Tesla products in a server or cluster environment, please use the latest recommended Tesla driver that has been qualified for use with this version of the CUDA Toolkit. If a recommended Tesla driver is not yet available, please check back in a few weeks.
Release notes are here: http://docs.nvidia.com/cuda/pdf/CUDA_Toolkit_Release_Notes.pdf
Seems complex enough that a VM dry run is indeed advisable.
So the installer asks to install the driver and then advises at the end if you don't that you need at least version 340.00 to actually use the toolkit (which you can install regardless and I am working on the module config).
The current node nvidia driver is 331.62. That makes this a bit more of a project to alter that on the nodes. Not too difficult, but we should clearly decide:
The module config is in test on mskcc-ln1 but due to the driver issue noted above is probably not overly interesting. But I added the default .version for 6.0 and a 6.5 module file. Head node only for now while the above is discussed/planned.
How about we schedule a driver update time by making a queue reservation to drain all jobs from the gpu queue before the driver update?
I don't think there's an easy way to do driver testing without upgrading a subset of the nodes and creating new driver-specific node properties to allow testing, but this sounds like a lot of unwarranted effort at this stage. I'd suggest we pick a time in the next week (up to you) and just install the latest driver across all nodes, reverting in the small chance that this ends up being a train wreck.
If node reboot (rather than modprobe) is required, this is more serious, though.
I'm looking at what was done for the last nvidia upgrade. Appears to have been pushed as a src build to all nodes. So the nvidia-331.62 source tree does remain available for revert. Let me mull a good schedule target.
I'd like to try to drain one node in the gpu queue just to manually test the upgrade steps in a ROCKS setting. I think I can do that.
OK! Keep us advised.
As I sit on hold with Dell for a failed drive I noted gpu-3-9 go idle so I Torque offlined it and updated the nvidia driver and pushed over the 6.5 module and cuda 6.5 libraries.
If possible please manually ssh to gpu-3-9 and test your GPU code. It will remain Torque offline during this process. (So we don't have job scheduled there in case its different or problematic)
I continue to determine the fastest way to do what I describe here on all nodes.
Back to hold music.
I had some issues:
-bash-4.1$ nvidia-smi
Failed to initialize NVML: Unknown Error
Same here:
[chodera@gpu-3-9 ~]$ nvidia-smi
Failed to initialize NVML: Unknown Error
It looks like nvidia-smi
may not have been updated?
[chodera@gpu-3-9 ~]$ which nvidia-smi
/usr/bin/nvidia-smi
[chodera@gpu-3-9 ~]$ ls -ltr /usr/bin/nvidia-smi
-rwxr-xr-x 1 root root 224904 May 6 16:03 /usr/bin/nvidia-smi
Mucking with modules doesn't fix this:
[chodera@gpu-3-9 ~]$ module list
Currently Loaded Modulefiles:
1) gcc/4.8.1(default) 2) cmake/2.8.10.2(default) 3) cuda/6.0(default) 4) mpich2_eth/1.5 5) cuda/5.5
[chodera@gpu-3-9 ~]$ module unload cuda/5.5
[chodera@gpu-3-9 ~]$ module load cuda/6.5
[chodera@gpu-3-9 ~]$ nvidia-smi
Failed to initialize NVML: Unknown Error
I am showing nvidia-smi is coming from another package. Which would need to also be updated.
Dell is now talking to me and so I will look at this more later.
Please try again while the music plays some more.
Looking good to me:
-bash-4.1$ nvidia-smi
Sat Aug 30 12:21:24 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 680 Off | 0000:03:00.0 N/A | N/A |
| 30% 39C P0 N/A / N/A | 10MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 680 Off | 0000:04:00.0 N/A | N/A |
| 30% 39C P0 N/A / N/A | 10MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 680 Off | 0000:83:00.0 N/A | N/A |
| 32% 45C P0 N/A / N/A | 10MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 680 Off | 0000:84:00.0 N/A | N/A |
| 31% 42C P0 N/A / N/A | 10MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
| 3 Not Supported |
-bash-4.1$ ${HOME}/ocore/ocore_601_CUDA_v18 -target 91f51e8a-7b40-4adc-8ca9-38786a9fe654
O O
P R O T E N E E R C--N \ \ N
| C C=O / \-C
C / | N-C \
.C-C C/ C C | C
/ \ O | | / N |
C C | | O C C /-C
\_N_/ \ N _C_ C | / O / C
C-/ \_C/ \N-/ \ N /-C-\ C | | O /
| | C-/ \C/ N-/ \_ N\ /C\ -C N | |
O | | | \C/ C/ N/ \_C__/ \ C-\ C
C O | | | | C-/ N/ \-C
\_C C O | O | |
\ \-O C C O
| \ \
C N Folding@Home C--N C
\ | OCore | |
N--C O |
\ Yutong Zhao C=O
N proteneer@gmail.com /
O
version 18
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc cc.proteneer.com...
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu...
preparing to start stream...
receiving response...
verifying hash...
assigned to stream fccb0bc9 from target db01ac5c
finished decodiing...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core...
setting initial states...
checking states for discrepancies...
^[OFentering main md loop...
resuming from step 9050
date time tpf ns/day frames steps
Aug/30 12:22:28PM 1:28:49 4.05 0 10340
OK, I will look at the process for doing this more rapidly in series and schedule it for next week.
Works for me too!
[chodera@gpu-3-9 ~]$ nvidia-smi
Sat Aug 30 12:23:17 2014
+------------------------------------------------------+
| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 680 Off | 0000:03:00.0 N/A | N/A |
| 47% 63C P0 N/A / N/A | 92MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 680 Off | 0000:04:00.0 N/A | N/A |
| 30% 34C P8 N/A / N/A | 11MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 680 Off | 0000:83:00.0 N/A | N/A |
| 30% 40C P8 N/A / N/A | 11MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 680 Off | 0000:84:00.0 N/A | N/A |
| 30% 37C P8 N/A / N/A | 11MiB / 4095MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
| 3 Not Supported |
+-----------------------------------------------------------------------------+
CUDA:
[chodera@gpu-3-9 ~/ocores]$ ./ocore_601_CUDA_v20
O O
P R O T E N E E R C--N \ \ N
| C C=O / \-C
C / | N-C \
.C-C C/ C C | C
/ \ O | | / N |
C C | | O C C /-C
\_N_/ \ N _C_ C | / O / C
C-/ \_C/ \N-/ \ N /-C-\ C | | O /
| | C-/ \C/ N-/ \_ N\ /C\ -C N | |
O | | | \C/ C/ N/ \_C__/ \ C-\ C
C O | | | | C-/ N/ \-C
\_C C O | O | |
\ \-O C C O
| \ \
C N Folding@Home C--N C
\ | OCore | |
N--C O |
\ Yutong Zhao C=O
N proteneer@gmail.com /
O
version 20
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc...
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu...
preparing to start stream...
receiving response...
verifying hash...
assigned to stream f6f59c56 from target db01ac5c
finished decoding...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core...
setting initial states...
checking states for discrepancies in initial state... reference...
entering main md loop...
resuming from step 13691
date time tpf ns/day frames steps
Aug/30 12:24:35PM 54:05 6.66 0.12 15039
Aug/30 12:25:36PM 21:31 16.73 0.18 22983
OpenCL:
[chodera@gpu-3-9 ~/ocores]$ ./ocore_601_OpenCL_v20
O O
P R O T E N E E R C--N \ \ N
| C C=O / \-C
C / | N-C \
.C-C C/ C C | C
/ \ O | | / N |
C C | | O C C /-C
\_N_/ \ N _C_ C | / O / C
C-/ \_C/ \N-/ \ N /-C-\ C | | O /
| | C-/ \C/ N-/ \_ N\ /C\ -C N | |
O | | | \C/ C/ N/ \_C__/ \ C-\ C
C O | | | | C-/ N/ \-C
\_C C O | O | |
\ \-O C C O
| \ \
C N Folding@Home C--N C
\ | OCore | |
N--C O |
\ Yutong Zhao C=O
N proteneer@gmail.com /
O
version 20
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc...
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu...
preparing to start stream...
receiving response...
verifying hash...
assigned to stream f6f59c56 from target db01ac5c
finished decoding...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core...
setting initial states...
checking states for discrepancies in initial state... reference...
entering main md loop...
resuming from step 80491
date time tpf ns/day frames steps
Aug/30 12:34:10PM 1:29:44 4.01 0.65 81466
@kyleabeauchamp : We're on ocores v20 now!
Yes I saw
OK. Working on the method of doing this the most rapid way on gpu-3-9. And then I'll schedule the reservation assuming I can determine the correct way for that for just a queue.
Believe I have a fairly rapid deployment method. Probably worst case lets say a morning of GPU downtime. (And I'm padding that for unexpected items)
Asking Adaptive for precise Moab "gpu resource only" reservation syntax. Alternately I believe I can disable the gpu Torque queue.
Great! Let us know what morning you'd like to reserve for this.
I will wait to see if Adaptive answers my question first but tentatively lets assume Thursday morning as a target. Worse case as you note, the nvidia upgrade will kill the jobs using the driver. But I'd like to learn the more graceful method.
Resecheduling. No answer to my question yet and too many other items interfering at the moment.
(My question to Adaptive about just reserving the gpu queue without disabling it)
Resecheduling. No answer to my question yet and too many other items interfering at the moment.
Understood. No problem at all!
This is proceeding now due to idle nodes due to #100
I believe this is ready for a manual test if you can pick a few nodes to ssh to as the scheduler remains offlined. I believe I've:
Updated the module and API tools pushed around the modules data and left 6.0 default Add the 6.5 cuda toolkit and samples to all nodes in /usr/local/cuda-6.5.
Can you advise if you see any errors?
nvidia-smi
is showing all nodes updated to driver 340.29, but I'm having some issues with the ocore
execution that I believe are unrelated. Let me do further testing.
OK, just make sure I've pushed the cuda modules stuff right.
Remember module cuda = 6.0 default still.
Shouldn't matter for the new drivers---code compiled with old CUDA should work with new drivers.
Having trouble with a CUDA 6.0-compiled version of OpenMM finding the CUDA
platform as well.
@kyleabeauchamp and @pgrinaway : Maybe you could lightly test versions of OpenMM you have compiled to see which Platforms are found and can run? You will have to ssh directly to a node to test.
did it work on gpu-3-9 when I upgraded there first? As method was the same...
Ok, just realized I had compiled against 5.5, I think Kyle is giving it a try but I can as well after a recompile.
Don't recompile! Everything should be forward compatible. We're explicitly wanting to test that.
Actually, I wonder if that is true. Could there be an issue with the rest of OpenMM compiled with one version of nvcc
but now another version of nvcc
used during execution to produce dynamically-compiled code?
Ok cool.
running make test right now
Sadly I had already deleted my openmm build so I rebuilt latest Git of OpenMM on that node using Cuda 6.5 and things seem OK. I have all 4 platforms in python and tests look fine so far (still running rest). I can't comment on backwards compatibility, but @pgrinaway seems to have no issues yet.
-bash-4.1$ ./TestOpenCLFFT
Done
-bash-4.1$ ./TestCudaNonbondedForce
Done
-bash-4.1$ ./TestOpenCLNonbondedForce
Done
I only have 3 platforms (not OpenCL) and everything is looking good so far
Can you give me an offline node id I can use to continue to test the CUDA driver upgrade? I don't want to cause any additional issues by accidentally working with a node that is online outside of the batch queue.
@pgrinaway : Did you compile the OpenCL platform? Maybe compare notes with @kyleabeauchamp?
@jchodera here is the list still offline
$ pbsnodes -nl gpu-1-14 offline reducing IO gpu-1-15 offline reducing IO gpu-1-16 offline reducing IO gpu-1-17 offline reducing IO gpu-2-4 offline reducing IO gpu-2-5 offline reducing IO gpu-2-6 offline reducing IO gpu-2-7 offline reducing IO gpu-2-8 offline reducing IO gpu-2-9 offline reducing IO gpu-2-10 offline reducing IO gpu-2-11 offline reducing IO gpu-2-12 offline reducing IO gpu-2-13 offline reducing IO gpu-2-14 offline reducing IO gpu-2-15 offline reducing IO gpu-2-16 offline reducing IO gpu-2-17 offline reducing IO gpu-3-8 offline reducing IO gpu-3-9 offline
I'll test on gpu-3-9 right now. Thanks!
@jchodera No, I didn't compile OpenCL.The final results of the test are:
105 - TestCudaNonbondedForceMixed (Failed)
106 - TestCudaNonbondedForceDouble (Failed)
160 - TestCudaAmoebaGeneralizedKirkwoodForceSingle (Failed)
161 - TestCudaAmoebaGeneralizedKirkwoodForceMixed (Failed)
162 - TestCudaAmoebaGeneralizedKirkwoodForceDouble (Failed)
Not sure what to make of this (this was on gpu-1-14, btw). Could be just stochastic test failures?
I don't believe these tests are stochastic.
@kyleabeauchamp : can you also test your version with make test
?
You can use gpu-3-9
for these tests.
I have zero CUDA failures, CL still running
Not urgent; install to replace CUDA 6.5 beta module when there is time to do so.
https://developer.nvidia.com/cuda-toolkit
CUDA 6.0 should remain the default CUDA until we have tested with 6.5 production release.