cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

CUDA 6.5 production release is out and should be installed #91

Closed jchodera closed 10 years ago

jchodera commented 10 years ago

Not urgent; install to replace CUDA 6.5 beta module when there is time to do so.

https://developer.nvidia.com/cuda-toolkit

CUDA 6.0 should remain the default CUDA until we have tested with 6.5 production release.

tatarsky commented 10 years ago

Noted it as well. I will get when I am back from D.C.

tatarsky commented 10 years ago

Will attempt in morning. Does notice need to be given in case of some situation where it interferes with 6.0 or something? (Seems unlikely....but just in case of very important jobs I'm asking)

jchodera commented 10 years ago

If this is installed as a module under /usr/local/cuda-6.5/, there should be no problem.

If there is a driver update required, that may cause running code to die, but that should be OK---I believe all our jobs can be recovered and resumed easily.

tatarsky commented 10 years ago

I'll try it in a VM to answer the above.

jchodera commented 10 years ago

It does look like drivers are included: https://developer.nvidia.com/cuda-downloads

Q: Are the latest NVIDIA drivers included in the CUDA Toolkit installers?
A: For convenience, the installer packages on this page include NVIDIA drivers which support application development for all CUDA-capable GPUs supported by this release of the CUDA Toolkit. If you are deploying applications on NVIDIA Tesla products in a server or cluster environment, please use the latest recommended Tesla driver that has been qualified for use with this version of the CUDA Toolkit. If a recommended Tesla driver is not yet available, please check back in a few weeks.

Release notes are here: http://docs.nvidia.com/cuda/pdf/CUDA_Toolkit_Release_Notes.pdf

Seems complex enough that a VM dry run is indeed advisable.

tatarsky commented 10 years ago

So the installer asks to install the driver and then advises at the end if you don't that you need at least version 340.00 to actually use the toolkit (which you can install regardless and I am working on the module config).

The current node nvidia driver is 331.62. That makes this a bit more of a project to alter that on the nodes. Not too difficult, but we should clearly decide:

  1. Do you want the driver upgraded? (I know there have been stability issues in the past and this version has been I believe operating well for people)
  2. We really should discuss first how we intend to move forward on upgrades generically and testing of said upgrades to prevent cluster instability. Thats probably a bigger conversation than this git issue.
tatarsky commented 10 years ago

The module config is in test on mskcc-ln1 but due to the driver issue noted above is probably not overly interesting. But I added the default .version for 6.0 and a 6.5 module file. Head node only for now while the above is discussed/planned.

jchodera commented 10 years ago

How about we schedule a driver update time by making a queue reservation to drain all jobs from the gpu queue before the driver update?

I don't think there's an easy way to do driver testing without upgrading a subset of the nodes and creating new driver-specific node properties to allow testing, but this sounds like a lot of unwarranted effort at this stage. I'd suggest we pick a time in the next week (up to you) and just install the latest driver across all nodes, reverting in the small chance that this ends up being a train wreck.

If node reboot (rather than modprobe) is required, this is more serious, though.

tatarsky commented 10 years ago

I'm looking at what was done for the last nvidia upgrade. Appears to have been pushed as a src build to all nodes. So the nvidia-331.62 source tree does remain available for revert. Let me mull a good schedule target.

tatarsky commented 10 years ago

I'd like to try to drain one node in the gpu queue just to manually test the upgrade steps in a ROCKS setting. I think I can do that.

jchodera commented 10 years ago

OK! Keep us advised.

tatarsky commented 10 years ago

As I sit on hold with Dell for a failed drive I noted gpu-3-9 go idle so I Torque offlined it and updated the nvidia driver and pushed over the 6.5 module and cuda 6.5 libraries.

If possible please manually ssh to gpu-3-9 and test your GPU code. It will remain Torque offline during this process. (So we don't have job scheduled there in case its different or problematic)

I continue to determine the fastest way to do what I describe here on all nodes.

Back to hold music.

kyleabeauchamp commented 10 years ago

I had some issues:

-bash-4.1$ nvidia-smi 
Failed to initialize NVML: Unknown Error
jchodera commented 10 years ago

Same here:

[chodera@gpu-3-9 ~]$ nvidia-smi
Failed to initialize NVML: Unknown Error

It looks like nvidia-smi may not have been updated?

[chodera@gpu-3-9 ~]$ which nvidia-smi
/usr/bin/nvidia-smi
[chodera@gpu-3-9 ~]$ ls -ltr /usr/bin/nvidia-smi
-rwxr-xr-x 1 root root 224904 May  6 16:03 /usr/bin/nvidia-smi
jchodera commented 10 years ago

Mucking with modules doesn't fix this:

[chodera@gpu-3-9 ~]$ module list
Currently Loaded Modulefiles:
  1) gcc/4.8.1(default)        2) cmake/2.8.10.2(default)   3) cuda/6.0(default)         4) mpich2_eth/1.5            5) cuda/5.5
[chodera@gpu-3-9 ~]$ module unload cuda/5.5
[chodera@gpu-3-9 ~]$ module load cuda/6.5
[chodera@gpu-3-9 ~]$ nvidia-smi
Failed to initialize NVML: Unknown Error
tatarsky commented 10 years ago

I am showing nvidia-smi is coming from another package. Which would need to also be updated.

Dell is now talking to me and so I will look at this more later.

tatarsky commented 10 years ago

Please try again while the music plays some more.

kyleabeauchamp commented 10 years ago

Looking good to me:

-bash-4.1$ nvidia-smi
Sat Aug 30 12:21:24 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.29     Driver Version: 340.29         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   39C    P0    N/A /  N/A |     10MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   39C    P0    N/A /  N/A |     10MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 32%   45C    P0    N/A /  N/A |     10MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 31%   42C    P0    N/A /  N/A |     10MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
-bash-4.1$ ${HOME}/ocore/ocore_601_CUDA_v18  -target  91f51e8a-7b40-4adc-8ca9-38786a9fe654 
                                          O              O                     
   P R O T E N E E R     C--N              \              \               N    
                         |                  C              C=O           / \-C 
                         C                 /               |          N-C     \
  .C-C                 C/                  C               C           |      C
 /    \          O     |                   |               /           N      |
C     C          |     |           O       C              C                 /-C
 \_N_/ \   N    _C_    C           |      /         O    /                 C   
        C-/ \_C/   \N-/ \    N   /-C-\   C          |    |           O    /    
        |     |           C-/ \C/     N-/ \_   N\  /C\  -C      N    |    |    
        O     |           |    |            \C/  C/   N/  \_C__/ \   C-\  C    
              C           O    |             |   |          |     C-/   N/ \-C
               \_C             C             O   |          O     |          | 
                  \             \-O              C                C          O 
                  |                               \                \           
                  C    N         Folding@Home      C--N             C          
                   \   |            OCore          |                |          
                    N--C                           O                |          
                        \        Yutong Zhao                       C=O        
                         N    proteneer@gmail.com                 /           
                                                                 O            
                                  version 18                   
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc cc.proteneer.com... 
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu... 
preparing to start stream...
receiving response...
verifying hash...
assigned to stream fccb0bc9 from target db01ac5c
finished decodiing...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core...                      
setting initial states...
checking states for discrepancies...
^[OFentering main md loop...
resuming from step 9050
  date       time       tpf   ns/day  frames      steps
Aug/30 12:22:28PM   1:28:49     4.05       0      10340 
tatarsky commented 10 years ago

OK, I will look at the process for doing this more rapidly in series and schedule it for next week.

jchodera commented 10 years ago

Works for me too!

[chodera@gpu-3-9 ~]$ nvidia-smi
Sat Aug 30 12:23:17 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.29     Driver Version: 340.29         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 47%   63C    P0    N/A /  N/A |     92MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   34C    P8    N/A /  N/A |     11MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 30%   40C    P8    N/A /  N/A |     11MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   37C    P8    N/A /  N/A |     11MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
+-----------------------------------------------------------------------------+

CUDA:

[chodera@gpu-3-9 ~/ocores]$ ./ocore_601_CUDA_v20
                                          O              O                     
   P R O T E N E E R     C--N              \              \               N    
                         |                  C              C=O           / \-C 
                         C                 /               |          N-C     \
  .C-C                 C/                  C               C           |      C
 /    \          O     |                   |               /           N      |
C     C          |     |           O       C              C                 /-C
 \_N_/ \   N    _C_    C           |      /         O    /                 C   
        C-/ \_C/   \N-/ \    N   /-C-\   C          |    |           O    /    
        |     |           C-/ \C/     N-/ \_   N\  /C\  -C      N    |    |    
        O     |           |    |            \C/  C/   N/  \_C__/ \   C-\  C    
              C           O    |             |   |          |     C-/   N/ \-C
               \_C             C             O   |          O     |          | 
                  \             \-O              C                C          O 
                  |                               \                \           
                  C    N         Folding@Home      C--N             C          
                   \   |            OCore          |                |          
                    N--C                           O                |          
                        \        Yutong Zhao                       C=O        
                         N    proteneer@gmail.com                 /           
                                                                 O            
                                  version 20                   
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc... 
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu... 
preparing to start stream...
receiving response...
verifying hash...
assigned to stream f6f59c56 from target db01ac5c
finished decoding...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core... 
setting initial states...
checking states for discrepancies in initial state... reference... 
entering main md loop...
resuming from step 13691
  date       time       tpf   ns/day  frames      steps
Aug/30 12:24:35PM     54:05     6.66    0.12      15039
Aug/30 12:25:36PM     21:31    16.73    0.18      22983

OpenCL:

[chodera@gpu-3-9 ~/ocores]$ ./ocore_601_OpenCL_v20 
                                          O              O                     
   P R O T E N E E R     C--N              \              \               N    
                         |                  C              C=O           / \-C 
                         C                 /               |          N-C     \
  .C-C                 C/                  C               C           |      C
 /    \          O     |                   |               /           N      |
C     C          |     |           O       C              C                 /-C
 \_N_/ \   N    _C_    C           |      /         O    /                 C   
        C-/ \_C/   \N-/ \    N   /-C-\   C          |    |           O    /    
        |     |           C-/ \C/     N-/ \_   N\  /C\  -C      N    |    |    
        O     |           |    |            \C/  C/   N/  \_C__/ \   C-\  C    
              C           O    |             |   |          |     C-/   N/ \-C
               \_C             C             O   |          O     |          | 
                  \             \-O              C                C          O 
                  |                               \                \           
                  C    N         Folding@Home      C--N             C          
                   \   |            OCore          |                |          
                    N--C                           O                |          
                        \        Yutong Zhao                       C=O        
                         N    proteneer@gmail.com                 /           
                                                                 O            
                                  version 20                   
===============================================================================
setting checkpoint interval to 7200 seconds
sleeping for 1 seconds
preparing for assignment...
connecting to cc... 
assigning core to a stream...ok
connecting to scv vspg11.stanford.edu... 
preparing to start stream...
receiving response...
verifying hash...
assigned to stream f6f59c56 from target db01ac5c
finished decoding...
deserializing system... state... integrator...
preparing the system for simulation...
system has 55678 atoms, 6 types of forces.
creating contexts: reference... core... 
setting initial states...
checking states for discrepancies in initial state... reference... 
entering main md loop...
resuming from step 80491
  date       time       tpf   ns/day  frames      steps
Aug/30 12:34:10PM   1:29:44     4.01    0.65      81466
jchodera commented 10 years ago

@kyleabeauchamp : We're on ocores v20 now!

kyleabeauchamp commented 10 years ago

Yes I saw

tatarsky commented 10 years ago

OK. Working on the method of doing this the most rapid way on gpu-3-9. And then I'll schedule the reservation assuming I can determine the correct way for that for just a queue.

tatarsky commented 10 years ago

Believe I have a fairly rapid deployment method. Probably worst case lets say a morning of GPU downtime. (And I'm padding that for unexpected items)

Asking Adaptive for precise Moab "gpu resource only" reservation syntax. Alternately I believe I can disable the gpu Torque queue.

jchodera commented 10 years ago

Great! Let us know what morning you'd like to reserve for this.

tatarsky commented 10 years ago

I will wait to see if Adaptive answers my question first but tentatively lets assume Thursday morning as a target. Worse case as you note, the nvidia upgrade will kill the jobs using the driver. But I'd like to learn the more graceful method.

tatarsky commented 10 years ago

Resecheduling. No answer to my question yet and too many other items interfering at the moment.

tatarsky commented 10 years ago

(My question to Adaptive about just reserving the gpu queue without disabling it)

jchodera commented 10 years ago

Resecheduling. No answer to my question yet and too many other items interfering at the moment.

Understood. No problem at all!

tatarsky commented 10 years ago

This is proceeding now due to idle nodes due to #100

tatarsky commented 10 years ago

I believe this is ready for a manual test if you can pick a few nodes to ssh to as the scheduler remains offlined. I believe I've:

Updated the module and API tools pushed around the modules data and left 6.0 default Add the 6.5 cuda toolkit and samples to all nodes in /usr/local/cuda-6.5.

Can you advise if you see any errors?

jchodera commented 10 years ago

nvidia-smi is showing all nodes updated to driver 340.29, but I'm having some issues with the ocore execution that I believe are unrelated. Let me do further testing.

tatarsky commented 10 years ago

OK, just make sure I've pushed the cuda modules stuff right.

Remember module cuda = 6.0 default still.

jchodera commented 10 years ago

Shouldn't matter for the new drivers---code compiled with old CUDA should work with new drivers.

Having trouble with a CUDA 6.0-compiled version of OpenMM finding the CUDA platform as well.

@kyleabeauchamp and @pgrinaway : Maybe you could lightly test versions of OpenMM you have compiled to see which Platforms are found and can run? You will have to ssh directly to a node to test.

tatarsky commented 10 years ago

did it work on gpu-3-9 when I upgraded there first? As method was the same...

pgrinaway commented 10 years ago

Ok, just realized I had compiled against 5.5, I think Kyle is giving it a try but I can as well after a recompile.

jchodera commented 10 years ago

Don't recompile! Everything should be forward compatible. We're explicitly wanting to test that.

jchodera commented 10 years ago

Actually, I wonder if that is true. Could there be an issue with the rest of OpenMM compiled with one version of nvcc but now another version of nvcc used during execution to produce dynamically-compiled code?

pgrinaway commented 10 years ago

Ok cool.

pgrinaway commented 10 years ago

running make test right now

kyleabeauchamp commented 10 years ago

Sadly I had already deleted my openmm build so I rebuilt latest Git of OpenMM on that node using Cuda 6.5 and things seem OK. I have all 4 platforms in python and tests look fine so far (still running rest). I can't comment on backwards compatibility, but @pgrinaway seems to have no issues yet.

-bash-4.1$ ./TestOpenCLFFT 
Done
-bash-4.1$ ./TestCudaNonbondedForce 
Done
-bash-4.1$ ./TestOpenCLNonbondedForce 
Done
pgrinaway commented 10 years ago

I only have 3 platforms (not OpenCL) and everything is looking good so far

jchodera commented 10 years ago

Can you give me an offline node id I can use to continue to test the CUDA driver upgrade? I don't want to cause any additional issues by accidentally working with a node that is online outside of the batch queue.

jchodera commented 10 years ago

@pgrinaway : Did you compile the OpenCL platform? Maybe compare notes with @kyleabeauchamp?

tatarsky commented 10 years ago

@jchodera here is the list still offline

$ pbsnodes -nl gpu-1-14 offline reducing IO gpu-1-15 offline reducing IO gpu-1-16 offline reducing IO gpu-1-17 offline reducing IO gpu-2-4 offline reducing IO gpu-2-5 offline reducing IO gpu-2-6 offline reducing IO gpu-2-7 offline reducing IO gpu-2-8 offline reducing IO gpu-2-9 offline reducing IO gpu-2-10 offline reducing IO gpu-2-11 offline reducing IO gpu-2-12 offline reducing IO gpu-2-13 offline reducing IO gpu-2-14 offline reducing IO gpu-2-15 offline reducing IO gpu-2-16 offline reducing IO gpu-2-17 offline reducing IO gpu-3-8 offline reducing IO gpu-3-9 offline

jchodera commented 10 years ago

I'll test on gpu-3-9 right now. Thanks!

pgrinaway commented 10 years ago

@jchodera No, I didn't compile OpenCL.The final results of the test are:

    105 - TestCudaNonbondedForceMixed (Failed)
    106 - TestCudaNonbondedForceDouble (Failed)
    160 - TestCudaAmoebaGeneralizedKirkwoodForceSingle (Failed)
    161 - TestCudaAmoebaGeneralizedKirkwoodForceMixed (Failed)
    162 - TestCudaAmoebaGeneralizedKirkwoodForceDouble (Failed)

Not sure what to make of this (this was on gpu-1-14, btw). Could be just stochastic test failures?

jchodera commented 10 years ago

I don't believe these tests are stochastic.

@kyleabeauchamp : can you also test your version with make test?

jchodera commented 10 years ago

You can use gpu-3-9 for these tests.

kyleabeauchamp commented 10 years ago

I have zero CUDA failures, CL still running