cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

CUDA+OpenMM problems #258

Closed danielparton closed 9 years ago

danielparton commented 9 years ago

I'm having trouble getting some OpenMM simulations to run. My guess is that it is something to do with the cluster environment or the way I am submitting the jobs (rather than a software problem), but I'm really not sure at this point.

I've been testing with single-GPU jobs, using the following qsub options: -q gpu -l procs=1,gpus=1:shared Each job performs a series of consecutive OpenMM simulations, each of which normally lasts around 5 minutes. The typical behavior is for ~6-10 simulations to complete successfully, then one of two things happen:

  1. the next simulation completes a few iterations, then simply hangs
  2. the next simulation completes a few iterations, then returns an exception as follows:
File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 11499, in step
    return _openmm.LangevinIntegrator_step(self, *args)
Exception: Error downloading array posq: Invalid error code (700)

In the second case, my Python wrapper program continues to execute, since the OpenMM call is wrapped with an exception-handler. All subsequent simulations then fail upon initialization with the following exception message:

File "/cbio/jclab/home/parton/opt/anaconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 4879, in __init__
    this = _openmm.new_Context(*args)
Exception: Error initializing Context: CUDA_ERROR_INVALID_DEVICE (101) at /cbio/jclab/home/parton/opt/openmm/openmm/platforms/cuda/src/CudaContext.cpp:149

In both cases I have been able to complete the same simulation on a second attempt, using the same GPU. If the program is left to run, then the same problem occurs a few simulations later.

Also, these errors only seem to occur with explicit-solvent MD simulations. My implicit-solvent MD simulations (which also use the OpenMM CUDA implementation) have been running without any problems.

Any suggestions on what might be the issue, or how I could debug this further?

Environment details

jchodera commented 9 years ago

The errors have occurred on all nodes tested, which represents maybe 10 different nodes. But I don't know if that included the gtxtitan nodes. I can try to work that out.

Ah, OK. Nevermind then! I thought it might have been just gpu-1-9 and gpu-1-12 (which are both GTX-680 nodes).

jchodera commented 9 years ago

Idea: What about forcing the OpenCL platform for now?

danielparton commented 9 years ago

Will give that a try.

On the CUDA platform, errors still seem to occur with 1 fs timestep (rather than default of 2 fs). Next I'll try the second and third tests I mentioned above.

jchodera commented 9 years ago

For reference, that exception is thrown here: https://github.com/pandegroup/openmm/blob/8b69b9c029f59ced3c9f02b3bd723702ec39d9c2/platforms/cuda/src/CudaArray.cpp#L77

jchodera commented 9 years ago

We are still debugging this. At least one test run cannot reproduce this issue on other hardware.

tatarsky commented 9 years ago

The question actually involved if you wanted the default cuda steps done to 6.5. Maybe I should separate that out.

tatarsky commented 9 years ago

Oh wait, now I've confused myself. Ignore the above.

jchodera commented 9 years ago

We are still debugging this. It is likely unrelated to the CUDA version issues, but we're not fully sure.

danielparton commented 9 years ago

So my test script runs without errors on the cluster using the CPU platform, and on two different linux boxes using the CUDA platform.

So it seems that the problem occurs only on the cluster when using the OpenMM CUDA platform.

Two things I should try initially:

On Fri, May 29, 2015 at 9:55 AM, John Chodera notifications@github.com wrote:

We are still debugging this. It is likely unrelated to the CUDA version issues, but we're not fully sure.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/258#issuecomment-106810454.

jchodera commented 9 years ago

It sounds from @danielparton that we need to reload the CUDA drivers on all nodes to clear some reliability issues.

I'm not sure of the easiest way to roll this out while not waiting for many days for us to have usable machines to run on.

Perhaps there is a way to stage this?

Apologies if this is a huge hassle. We can figure out whether there might be some way to do this automatically if it is clear that this is a persistent problem.

tatarsky commented 9 years ago

Can you clarify please what you mean by "reload the drivers" ?

rmmod/modprobe them?

jchodera commented 9 years ago

rmmod/modprobe them?

I believe that is correct. @danielparton will confirm.

See his comment in https://github.com/pandegroup/openmm/issues/926

Essentially, we need to do whatever your day of experimentation with @danielparton determined was able to eliminate CUDA_ERROR_INVALID_DEVICE errors.

tatarsky commented 9 years ago

OK. Looks like reload or reboot. We can prep for either method. Reboot obviously requires batch to be suspended as well so I'll prepare for "reload nvidia driver" and wait for @danielparton details.

danielparton commented 9 years ago

I'm pretty certain that rmmod/modprobe should be sufficient. When testing on gpu-1-9, that action seemed to resolve the CUDA_ERROR_INVALID_DEVICE errors. The other errors I was seeing - unresolved on gpu-1-9 by either reload or reboot - have since been resolved by updating and reinstalling my OpenMM installation

tatarsky commented 9 years ago

OK. Well, I could also just disable the gpu queue globally, wait for GPU jobs to finish and bulk do the above on all nodes....I show a few users in the gpu queue actively.

jchodera commented 9 years ago

I think it's just Theo and @danielparton.

(username content removed but the above sentence is true)

@danielparton : Can you coordinate with Theo so @tatarsky can just do a bulk driver reload?

jchodera commented 9 years ago

(Note that he has a NIPS paper deadline, so we will want to find a time that works well for him.)

tatarsky commented 9 years ago

I'm happy to do the phased one but if in a hurry clearing all gpu jobs and doing it in one shot is probably time quicker.

tatarsky commented 9 years ago

Also, if you want to CONFIRM that rmmod/modprobe is reasonable we can do a few now...just advise the direction you want to take.

jchodera commented 9 years ago

I'll turn this over to @danielparton to find the most expedient way forward.

danielparton commented 9 years ago

Ok I'll check with Theo. The rmmod/modprob action did seem to resolve the problem previously, so I think at this point it makes sense to rmmod/modprobe all nodes - no need for further confirmation.

tatarsky commented 9 years ago

OK. Just for relatively confirmation I've slapped a reservation on gpu-1-4 and gpu-1-5 after confirming no Theo jobs are on them....reloading nvidia driver and I'll slap a property on them to test if you want.

jchodera commented 9 years ago

If Theo's jobs will be done shortly, you can just rmmod/modprobe everything. @danielparton will talk with him.

For now, it's best to remove all nodes from the gpu queue just so nothing further gets scheduled.

tatarsky commented 9 years ago

I don't believe there is a concept of "remove all nodes from the gpu queue". Its a queue based on the batch queue as far as I can see in the Torque config (need nodes batch).

I'm adding reservations for the gpu resources not in use by Theo and confirming if I disable the gpu queue that it leaves his jobs alone .

As I'm not 100% sure I'm err'ing on the side of extreme caution.

danielparton commented 9 years ago

Theo's not in - will email him. I believe the NIPS deadline is Fri Jun 5. If it's important for Theo to keep his jobs running until then, I suggest we wait until then to complete the driver reloads.

The errors I am experiencing with my simulations are sufficiently rare that I should be able to proceed for now without too much trouble.

tatarsky commented 9 years ago

Well, I'm basically half way to the "staged" method already and willing to just do the little extra work to make things move along.

Do me a favor and see if you can request the property "gpureload" and have success with those reloaded nodes. If I could add gpu-1-9 back into the pool that would be three of them. (gpu-1-4 and gpu-1-5 are reloaded and online)

tatarsky commented 9 years ago

You might mention I noted two of his jobs exit after hitting the walltime limit. Another is two hours away from 72 hours.

danielparton commented 9 years ago

Sounds good - how do I request that property? -l procs=4,gpus=1:shared:gpureload?

You can add gpu-1-9 back into the queue, but maybe don't give it the gpureload property, since it has also been rebooted. Just in case.

tatarsky commented 9 years ago

It should be usable the same way as the gtxtitan/gtx680 but the above didn't seem to work. So I'm checking.

jchodera commented 9 years ago

Try using multiple -l arguments:

-l procs=4,gpus=1:shared -l gpureload

I believe that's actually equivalent to

-l procs=4 -l gpus=1:shared -l gpureload
tatarsky commented 9 years ago

This worked for me but I can't seem to get a "procs" variant to work. Attribute is set the same way we do the GPU variant attribute.

qsub -I -l nodes=1:gpus=1:shared:gpureload
danielparton commented 9 years ago

Yep, no problem - I switched to using the "nodes" argument. Seems to be working now. I submitted two 4-GPU jobs. One is running on gpu-1-4, one on gpu-1-5.

This is the argument I used:

-l nodes=1:ppn=4:gpus=4:gpureload
tatarsky commented 9 years ago

OK. Cool. Lets see how those do and then I can just roll down the line....how long will those take to confirm?

danielparton commented 9 years ago

Something like 10 hours. These particular errors are rarer now that the other errors are resolved.

tatarsky commented 9 years ago

OK. I can just roll down the line if you believe this will work and get you back in business. Leaving Theo's nodes alone until the end.

tatarsky commented 9 years ago

gpu-1-9 onlined without the gpureload property if you wish to test. I've not reserved it.

Adaptive confirms a queue disable will kill the jobs of Theo so at this time the rolling reservation is the only method I have to accommodate this due to the node "use batch" config (I cannot remove nodes from the gpu queue, just reserve their GPUs)

tatarsky commented 9 years ago

I estimate about 30 minutes to reload all currently reserved nodes with no GPU jobs....advise if you would like me to proceed.

tatarsky commented 9 years ago

Oh, and Adaptive didn't understand my question I guess and I can disable the queue. But now that I've scripted the rolling method I can still do it quicker ;)

danielparton commented 9 years ago

Sure, please go ahead and reload the nodes which are free of GPU jobs. Thanks!

tatarsky commented 9 years ago

Fortunately I am doing a set of additional pre-checks because there is a user on gpu-1-16 running GPU code via the batch queue. Skipping gpu-1-16 but is that normal?

tatarsky commented 9 years ago

Same issue/user gpu-2-10...just FYI...the gpureload is your friend going forward and I'll have to watch these for complete.

tatarsky commented 9 years ago

And lots more in rack 2 had similar matter. So I've done all I could safely do. I assume 14 is better than zero. I can do a reload on gpu-1-9 if you'd like that one confirmed reloaded and thus usable via the property.

jchodera commented 9 years ago

It is totally allowable to run on GPUs through the batch queue. This is necessary to use more resources than are available through the gpu queue.

tatarsky commented 9 years ago

Okey doke! I'd never really noticed it before but then again I've not really looked. I will take care of the non-reloaded nodes as I can but there are now 14 with the "gpureload" property.

tatarsky commented 9 years ago

Got it up to 17 now but the remaining ones won't be free for awhile judging by reservations and current wallclock. But if you use the gpureload property you at least have roughly a rack of gpus to process your items. If you confirm you want gpu-1-9 reloaded you'll have 18.

Is this something you expect to be a regular need? (as I'll work on scripting it more completely if you do and may be able to automate the reload based on the idle condition....)

I will leave this open regardless until the rest are done.

jchodera commented 9 years ago

Is this something you expect to be a regular need? (as I'll work on scripting it more completely if you do and may be able to automate the reload based on the idle condition....)

Unfortunately, we don't know, but I bet that it couldn't hurt to write this script.

Thanks so much for your help here! This has been a tough issue to figure out!

tatarsky commented 9 years ago

No prob. We'll just cross that bridge if it comes and use the somewhat jury rigged method I did this first attempt.

danielparton commented 9 years ago

Driver reloads seem to have been a complete success! Zero CUDA_ERROR_INVALID_DEVICE messages since the driver reloads on gpu-1-4 and gpu-1-5. Thanks for all the help with this!

tatarsky commented 9 years ago

Excellent! So we don't have to stop the car to fix the problem as it were ;)

I took care of the reloads on another batch this morning (now you have 23 nodes with the gpureload property) and am just watching for the remaining ones which continue to run gpu jobs.

I'll keep this open just to advise when done and then I'll probably drop the property after a few days of confirmed that you are good.

tatarsky commented 9 years ago

Note for myself: possible item for node health script to have a flag to detect a reservation of a certain name and to reload nvidia after checking its not in use. Verify that its not even possible to rmmod nvidia when the GPU is in use. (appears to be the case but confirm).