Closed danielparton closed 9 years ago
Here is my interpretation:
The error downloading array to me seems like a potential simulation instability, which then somehow keeps that GPU from running additional jobs. The stochastic nature of things is presumably due to different GPUs, random seeds, or initial velocities.
I'm curious why we're using OpenMM 6.1 and not 6.2
I believe the error handling was improved in 6.2, so that the error messages may be more informative (e.g. if @kyleabeauchamp's hypothesis is correct, you'll see info about positions becoming NaN).
Once you get to the Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE
stage, the driver (or node) probably needs to be restarted. GPUs locked in this state may eat subsequent jobs.
Do you have info on which specific nodes/gpus have been giving you that particular error?
It's possible some of the problematic nodes may just need a driver reload or reboot, too. Our uptimes are coming close to ~1 year:
[chodera@gpu-1-10 ~]$ uptime
16:55:10 up 297 days, 2:16, 1 user, load average: 32.12, 32.11, 31.99
Can you also test on gpu-3-9
? This has been take out of the gpu
queue so we can experiment with driver stability. The new driver may have stability versions over the old one even with no further changes (#235).
You can just ssh to it and try this (mindful to keep the number of simultaneous threads <= 4).
Correct (uptimes nearing a year). Which of the following paths would you like to walk down.
Sunday response. I may not actually get the requested item today but easier to drain on a weekend usually.
Uptimes, for reference:
[chodera@mskcc-ln1 ~/scripts]$ ./check-node-uptimes.tcsh
gpu-1-4 297 days
gpu-1-5 297 days
gpu-1-6 297 days
gpu-1-7 297 days
gpu-1-8 297 days
gpu-1-9 297 days
gpu-1-10 297 days
gpu-1-11 297 days
gpu-1-12 297 days
gpu-1-13 297 days
gpu-1-14 297 days
gpu-1-15 297 days
gpu-1-16 297 days
gpu-1-17 297 days
gpu-2-4 297 days
gpu-2-5 297 days
gpu-2-6 297 days
gpu-2-7 297 days
gpu-2-8 297 days
gpu-2-9 297 days
gpu-2-10 297 days
gpu-2-11 297 days
gpu-2-12 297 days
gpu-2-13 297 days
gpu-2-14 297 days
gpu-2-15 297 days
gpu-2-16 297 days
gpu-2-17 297 days
gpu-3-8 297 days
gpu-3-9 297 days
(With many kudos to @tatarsky for bringing stability into our lives!)
I do have info on which specific nodes/GPUs were in use.
However, after each job finished, the CUDA driver was apparently usable again. In both cases, I was able to run the same Ensembler job on the same GPU, and the same behavior occurred. A few simulations completed, then a similar error occurred.
Stability thanks -> tips hat and re-tips to NJ datacenter folks ;)
I think torque has a script in place that resets the nvidia driver (or resets the GPU through the nvidia driver) after after GPU-allocated jobs terminate.
That is correct.
Could be worth rebooting one of those nodes, or restarting the CUDA driver? Then I can test my simulations and see if the problem is resolved.
I can also try again with OpenMM 6.2 in case the error messages provide more info.
I'd say our plan of action should be:
gpu-1-12
and gpu-1-9
in torque to drain the nodesThis testing would need to be done during a workday.
In the meantime, @danielparton could try OpenMM 6.2 under existing conditions to see if the improved error trapping helps, and gpu-3-9
under existing OpenMM 6.1 to see if the improved driver helps.
Well, unlike the last several weekends the cluster at the moment is 100% slot filled. I can offline a node but let me look at the walltimes.
Noting John's comment I have done the requested on gpu-1-12 and gpu-1-9.
I will update this issue when those nodes drain for the next steps.
Well, unlike the last several weekends the cluster at the moment is 100% slot filled. I can offline a node but let me look at the walltimes.
I was thinking we would have to make a reservation for a few days in advance so that the node would drain on its own.
Noting John's comment I have done the requested on gpu-1-12 and gpu-1-9.
Ah, I see---thanks!
Offline is somewhat more useful for this class of item as if the jobs drain, the node will be ready for you. Offline allows running jobs to complete and does not schedule new ones.
Offline is somewhat more useful for this class of item as if the jobs drain, the node will be ready for you. Offline allows running jobs to complete and does not schedule new ones.
Sounds great, thanks!
Looks like 48 hour walltimes on the jobs on both nodes. Off to attempt to unclog a different sort of queue. (My gutters after large rain storm)
You need one of these!
Sounds good, I'll try this out with OpenMM 6.2 now. Thanks for all the input!
Ok I've tried this out again with OpenMM 6.2. So far it has been the same behavior (with error message, rather than the "hang-with-no-output" behavior).
5th simulation:
File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 12102, in step
return _openmm.LangevinIntegrator_step(self, *args)
Exception: Error downloading array energyBuffer: Invalid error code (700)
6th and subsequent simulations:
File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15302, in __init__
this = _openmm.new_Context(*args)
Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE (101) at /cbio/jclab/home/parton/opt/openmm/openmm/platforms/cuda/src/CudaContext.cpp:149
Host details:
Probably worth posting this error in the OpenMM issue tracker too.
gpu-1-12 and gpu-1-9 are offline and drained.
rmmod nvidia;modprobe nvidia was my next step.
Confirm thats the plan and I'll do that.
Yep, that's the plan - confirmed in person with @jchodera Thanks!
Module rmmod'd and re-probed.
See if that changes anything but happy to reboot as well.
Units are NOT in torque so manually ssh.
Thanks, testing now
Ok, still getting errors. Can we try rebooting them both?
For reference - results of testing after CUDA driver reload:
gpu-1-9 - sims 1-6 completed successfully; sims 7-8 both gave error message below; sims 9-11 completed successfully; sim 12 hangs with no further output
File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 12102, in step
return _openmm.LangevinIntegrator_step(self, *args)
Exception: Error downloading array energyBuffer: Invalid error code (700)
Rebooting gpu-1-9 first (in case the problem goes away so we can perhaps debug gpu-1-12 a bit more)
gpu-1-9 is back. Try try again. I can reboot gpu-1-12 if no change for a second test. That may be tomorrow as I am out for a bit.
Thanks! Trying out gpu-1-9 now. (No rush with gpu-1-12)
gpu-1-9 is acting a bit odd....
I've had to do a cold reset. Something wedged...
Thats looking a bit better. I did a cold reset this time instead of a reboot. Can you retry tests?
No problem, trying again
Does the fact its still running indicate a good outcome?
Unfortunately not. This time about 56 simulations were run, two of which failed with the Exception: Error downloading array posq: Invalid error code (700)
message. Then 57th simulation was just hanging with no output.
So right now I'm still not sure whether this is an issue with the cluster environment, the OpenMM simulation software, or my Ensembler code. We had equivalent simulations running ok previously, using earlier versions of the above three things. Most confusing...
Anyway, I'll keep looking into this, but I'm guessing it might take a while. I would suggest that one of the two offlined nodes can be returned to the batch queue immediately. From my own point of view, it would be useful to keep one of the nodes offlined so I can continue testing. However, if anyone feels that we should bring them both back into the batch queue, that's fine - we could always drain and offline a node again at a later point if necessary.
Thanks again for the help on this! Was useful to test out the node reboot, if only to eliminate that as a solution.
Can't think of any recent changes and we've removed Torque/Moab from the equation by running natively. I'll put gpu-1-12 back in and leave gpu-1-9 out for now (the one we rebooted)
This is the OpenMM 6.2 conda package from Binstar, right?
Yes
I'm thinking of trying out the following:
openmmtools
Src test system to see if that also failsHere is a driver script for the openmmtools test. Some modifications may be required.
from simtk.openmm import app
import simtk.openmm as mm
from simtk import unit as u
from sys import stdout
import openmmtools
testsystem = openmmtools.testsystems.SrcExplicit()
integrator = mm.LangevinIntegrator(300*u.kelvin, 1.0/u.picoseconds, 1.0*u.femtoseconds)
simulation = app.Simulation(testsystem.topology, testsystem.system, integrator)
simulation.context.setPositions(testsystem.positions)
simulation.minimizeEnergy()
print('Minimizing...')
simulation.context.setVelocitiesToTemperature(300*u.kelvin)
print('Equilibrating...')
simulation.step(100)
simulation.reporters.append(app.DCDReporter('trajectory.dcd', 1000))
simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True,
potentialEnergy=True, temperature=True, progress=True, remainingTime=True,
speed=True, totalSteps=1000, separator='\t'))
Do we think only these nodes are problematic? If so, maybe you can restrict to gtxtitan
nodes for finishing the project and testing whether there are failures on those nodes too?
Why restrict to the gtxtitan
nodes (aside from testing whether the errors occur on those nodes too)? Just because the the GPUs faster? Just trying to clarify.
The errors have occurred on all nodes tested, which represents maybe 10 different nodes. But I don't know if that included the gtxtitan
nodes. I can try to work that out.
I show the titan nodes also have more GPU memory. I don't know however if thats related. Just noting.
GeForce GTX 680 4095MiB
GeForce GTX TITAN 6143MiB
If you want a Titan offlined BTW, just shout.
I'm having trouble getting some OpenMM simulations to run. My guess is that it is something to do with the cluster environment or the way I am submitting the jobs (rather than a software problem), but I'm really not sure at this point.
I've been testing with single-GPU jobs, using the following qsub options:
-q gpu -l procs=1,gpus=1:shared
Each job performs a series of consecutive OpenMM simulations, each of which normally lasts around 5 minutes. The typical behavior is for ~6-10 simulations to complete successfully, then one of two things happen:In the second case, my Python wrapper program continues to execute, since the OpenMM call is wrapped with an exception-handler. All subsequent simulations then fail upon initialization with the following exception message:
In both cases I have been able to complete the same simulation on a second attempt, using the same GPU. If the program is left to run, then the same problem occurs a few simulations later.
Also, these errors only seem to occur with explicit-solvent MD simulations. My implicit-solvent MD simulations (which also use the OpenMM CUDA implementation) have been running without any problems.
Any suggestions on what might be the issue, or how I could debug this further?
Environment details
build_mpirun_configfile