cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

CUDA+OpenMM problems #258

Closed danielparton closed 9 years ago

danielparton commented 9 years ago

I'm having trouble getting some OpenMM simulations to run. My guess is that it is something to do with the cluster environment or the way I am submitting the jobs (rather than a software problem), but I'm really not sure at this point.

I've been testing with single-GPU jobs, using the following qsub options: -q gpu -l procs=1,gpus=1:shared Each job performs a series of consecutive OpenMM simulations, each of which normally lasts around 5 minutes. The typical behavior is for ~6-10 simulations to complete successfully, then one of two things happen:

  1. the next simulation completes a few iterations, then simply hangs
  2. the next simulation completes a few iterations, then returns an exception as follows:
File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 11499, in step
    return _openmm.LangevinIntegrator_step(self, *args)
Exception: Error downloading array posq: Invalid error code (700)

In the second case, my Python wrapper program continues to execute, since the OpenMM call is wrapped with an exception-handler. All subsequent simulations then fail upon initialization with the following exception message:

File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 4879, in __init__
    this = _openmm.new_Context(*args)
Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE (101) at /cbio/jclab/home/parton/opt/openmm/openmm/platforms/cuda/src/CudaContext.cpp:149

In both cases I have been able to complete the same simulation on a second attempt, using the same GPU. If the program is left to run, then the same problem occurs a few simulations later.

Also, these errors only seem to occur with explicit-solvent MD simulations. My implicit-solvent MD simulations (which also use the OpenMM CUDA implementation) have been running without any problems.

Any suggestions on what might be the issue, or how I could debug this further?

Environment details

kyleabeauchamp commented 9 years ago

Here is my interpretation:

The error downloading array to me seems like a potential simulation instability, which then somehow keeps that GPU from running additional jobs. The stochastic nature of things is presumably due to different GPUs, random seeds, or initial velocities.

jchodera commented 9 years ago

I'm curious why we're using OpenMM 6.1 and not 6.2

jchodera commented 9 years ago

I believe the error handling was improved in 6.2, so that the error messages may be more informative (e.g. if @kyleabeauchamp's hypothesis is correct, you'll see info about positions becoming NaN).

jchodera commented 9 years ago

Once you get to the Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE stage, the driver (or node) probably needs to be restarted. GPUs locked in this state may eat subsequent jobs.

Do you have info on which specific nodes/gpus have been giving you that particular error?

jchodera commented 9 years ago

It's possible some of the problematic nodes may just need a driver reload or reboot, too. Our uptimes are coming close to ~1 year:

[chodera@gpu-1-10 ~]$ uptime
 16:55:10 up 297 days,  2:16,  1 user,  load average: 32.12, 32.11, 31.99
jchodera commented 9 years ago

Can you also test on gpu-3-9? This has been take out of the gpu queue so we can experiment with driver stability. The new driver may have stability versions over the old one even with no further changes (#235).

You can just ssh to it and try this (mindful to keep the number of simultaneous threads <= 4).

tatarsky commented 9 years ago

Correct (uptimes nearing a year). Which of the following paths would you like to walk down.

  1. I can offline the node in Torque and when all GPU and regular jobs drain I can reboot it.
  2. I can reserve the GPUs and when the GPU jobs drain I can attempt to reload the nvidia driver.

Sunday response. I may not actually get the requested item today but easier to drain on a weekend usually.

jchodera commented 9 years ago

Uptimes, for reference:

[chodera@mskcc-ln1 ~/scripts]$ ./check-node-uptimes.tcsh
gpu-1-4 297 days
gpu-1-5 297 days
gpu-1-6 297 days
gpu-1-7 297 days
gpu-1-8 297 days
gpu-1-9 297 days
gpu-1-10 297 days
gpu-1-11 297 days
gpu-1-12 297 days
gpu-1-13 297 days
gpu-1-14 297 days
gpu-1-15 297 days
gpu-1-16 297 days
gpu-1-17 297 days
gpu-2-4 297 days
gpu-2-5 297 days
gpu-2-6 297 days
gpu-2-7 297 days
gpu-2-8 297 days
gpu-2-9 297 days
gpu-2-10 297 days
gpu-2-11 297 days
gpu-2-12 297 days
gpu-2-13 297 days
gpu-2-14 297 days
gpu-2-15 297 days
gpu-2-16 297 days
gpu-2-17 297 days
gpu-3-8 297 days
gpu-3-9 297 days
jchodera commented 9 years ago

(With many kudos to @tatarsky for bringing stability into our lives!)

danielparton commented 9 years ago

I do have info on which specific nodes/GPUs were in use.

However, after each job finished, the CUDA driver was apparently usable again. In both cases, I was able to run the same Ensembler job on the same GPU, and the same behavior occurred. A few simulations completed, then a similar error occurred.

tatarsky commented 9 years ago

Stability thanks -> tips hat and re-tips to NJ datacenter folks ;)

jchodera commented 9 years ago

I think torque has a script in place that resets the nvidia driver (or resets the GPU through the nvidia driver) after after GPU-allocated jobs terminate.

tatarsky commented 9 years ago

That is correct.

danielparton commented 9 years ago

Could be worth rebooting one of those nodes, or restarting the CUDA driver? Then I can test my simulations and see if the problem is resolved.

I can also try again with OpenMM 6.2 in case the error messages provide more info.

jchodera commented 9 years ago

I'd say our plan of action should be:

This testing would need to be done during a workday.

In the meantime, @danielparton could try OpenMM 6.2 under existing conditions to see if the improved error trapping helps, and gpu-3-9 under existing OpenMM 6.1 to see if the improved driver helps.

tatarsky commented 9 years ago

Well, unlike the last several weekends the cluster at the moment is 100% slot filled. I can offline a node but let me look at the walltimes.

Noting John's comment I have done the requested on gpu-1-12 and gpu-1-9.

tatarsky commented 9 years ago

I will update this issue when those nodes drain for the next steps.

jchodera commented 9 years ago

Well, unlike the last several weekends the cluster at the moment is 100% slot filled. I can offline a node but let me look at the walltimes.

I was thinking we would have to make a reservation for a few days in advance so that the node would drain on its own.

jchodera commented 9 years ago

Noting John's comment I have done the requested on gpu-1-12 and gpu-1-9.

Ah, I see---thanks!

tatarsky commented 9 years ago

Offline is somewhat more useful for this class of item as if the jobs drain, the node will be ready for you. Offline allows running jobs to complete and does not schedule new ones.

jchodera commented 9 years ago

Offline is somewhat more useful for this class of item as if the jobs drain, the node will be ready for you. Offline allows running jobs to complete and does not schedule new ones.

Sounds great, thanks!

tatarsky commented 9 years ago

Looks like 48 hour walltimes on the jobs on both nodes. Off to attempt to unclog a different sort of queue. (My gutters after large rain storm)

jchodera commented 9 years ago

You need one of these!

danielparton commented 9 years ago

Sounds good, I'll try this out with OpenMM 6.2 now. Thanks for all the input!

danielparton commented 9 years ago

Ok I've tried this out again with OpenMM 6.2. So far it has been the same behavior (with error message, rather than the "hang-with-no-output" behavior).

5th simulation:

  File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 12102, in step
    return _openmm.LangevinIntegrator_step(self, *args)
Exception: Error downloading array energyBuffer: Invalid error code (700)

6th and subsequent simulations:

  File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15302, in __init__
    this = _openmm.new_Context(*args)
Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE (101) at /cbio/jclab/home/parton/opt/openmm/openmm/platforms/cuda/src/CudaContext.cpp:149

Host details:

jchodera commented 9 years ago

Probably worth posting this error in the OpenMM issue tracker too.

tatarsky commented 9 years ago

gpu-1-12 and gpu-1-9 are offline and drained.

rmmod nvidia;modprobe nvidia was my next step.

Confirm thats the plan and I'll do that.

danielparton commented 9 years ago

Yep, that's the plan - confirmed in person with @jchodera Thanks!

tatarsky commented 9 years ago

Module rmmod'd and re-probed.

See if that changes anything but happy to reboot as well.

Units are NOT in torque so manually ssh.

danielparton commented 9 years ago

Thanks, testing now

danielparton commented 9 years ago

Ok, still getting errors. Can we try rebooting them both?

For reference - results of testing after CUDA driver reload:

tatarsky commented 9 years ago

Rebooting gpu-1-9 first (in case the problem goes away so we can perhaps debug gpu-1-12 a bit more)

tatarsky commented 9 years ago

gpu-1-9 is back. Try try again. I can reboot gpu-1-12 if no change for a second test. That may be tomorrow as I am out for a bit.

danielparton commented 9 years ago

Thanks! Trying out gpu-1-9 now. (No rush with gpu-1-12)

tatarsky commented 9 years ago

gpu-1-9 is acting a bit odd....

tatarsky commented 9 years ago

I've had to do a cold reset. Something wedged...

tatarsky commented 9 years ago

Thats looking a bit better. I did a cold reset this time instead of a reboot. Can you retry tests?

danielparton commented 9 years ago

No problem, trying again

tatarsky commented 9 years ago

Does the fact its still running indicate a good outcome?

danielparton commented 9 years ago

Unfortunately not. This time about 56 simulations were run, two of which failed with the Exception: Error downloading array posq: Invalid error code (700) message. Then 57th simulation was just hanging with no output.

danielparton commented 9 years ago

So right now I'm still not sure whether this is an issue with the cluster environment, the OpenMM simulation software, or my Ensembler code. We had equivalent simulations running ok previously, using earlier versions of the above three things. Most confusing...

Anyway, I'll keep looking into this, but I'm guessing it might take a while. I would suggest that one of the two offlined nodes can be returned to the batch queue immediately. From my own point of view, it would be useful to keep one of the nodes offlined so I can continue testing. However, if anyone feels that we should bring them both back into the batch queue, that's fine - we could always drain and offline a node again at a later point if necessary.

Thanks again for the help on this! Was useful to test out the node reboot, if only to eliminate that as a solution.

tatarsky commented 9 years ago

Can't think of any recent changes and we've removed Torque/Moab from the equation by running natively. I'll put gpu-1-12 back in and leave gpu-1-9 out for now (the one we rebooted)

jchodera commented 9 years ago

This is the OpenMM 6.2 conda package from Binstar, right?

danielparton commented 9 years ago

Yes

danielparton commented 9 years ago

I'm thinking of trying out the following:

kyleabeauchamp commented 9 years ago

Here is a driver script for the openmmtools test. Some modifications may be required.

from simtk.openmm import app
import simtk.openmm as mm
from simtk import unit as u
from sys import stdout
import openmmtools

testsystem = openmmtools.testsystems.SrcExplicit()

integrator = mm.LangevinIntegrator(300*u.kelvin, 1.0/u.picoseconds, 1.0*u.femtoseconds)

simulation = app.Simulation(testsystem.topology, testsystem.system, integrator)
simulation.context.setPositions(testsystem.positions)

simulation.minimizeEnergy()
print('Minimizing...')

simulation.context.setVelocitiesToTemperature(300*u.kelvin)
print('Equilibrating...')
simulation.step(100)

simulation.reporters.append(app.DCDReporter('trajectory.dcd', 1000))
simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True, 
    potentialEnergy=True, temperature=True, progress=True, remainingTime=True, 
    speed=True, totalSteps=1000, separator='\t'))
jchodera commented 9 years ago

Do we think only these nodes are problematic? If so, maybe you can restrict to gtxtitan nodes for finishing the project and testing whether there are failures on those nodes too?

danielparton commented 9 years ago

Why restrict to the gtxtitan nodes (aside from testing whether the errors occur on those nodes too)? Just because the the GPUs faster? Just trying to clarify.

The errors have occurred on all nodes tested, which represents maybe 10 different nodes. But I don't know if that included the gtxtitan nodes. I can try to work that out.

tatarsky commented 9 years ago

I show the titan nodes also have more GPU memory. I don't know however if thats related. Just noting.

GeForce GTX 680    4095MiB
GeForce GTX TITAN  6143MiB
tatarsky commented 9 years ago

If you want a Titan offlined BTW, just shout.