Closed BJWiley233 closed 2 years ago
Hi, here is an image (https://hub.docker.com/r/zhiwang/tinker9/tags) I build in early March when I worked on the Dockerfiles. I don't know if anyone has tried it. We are not experienced docker users. Any suggestions and bug reports are greatly appreciated.
If you want to build your latest greatest Tinker9, it will take a while. Most of the time will be spent on downloading the NVHPC compilers. Generating the runtime image, compilation and installation will take much less time. For details, please find the python script tinker9.docker
in the tinker9/docker directory and run tinker9.docker -h
Currently, 2 CUDA versions (10.2 and 11.2.2) are supported by this script.
As you noted, tinkertools
has not been set up on the docker hub yet. I need to ping Dr. Ponder @jayponder on this issue again. So before this repo is set up and the devel
image is pushed to the repo, you'd have to build it yourself with this python script. I think everything is automated now. BTW the devel
image we created for cuda 11.2.2 is 20+ G. We are not even sure if we will push the devel
image to the repo in the future. Any suggestions on this?
Thanks for the response. RIS as WashU is still using 10.1 CUDA drivers. When I run the tinker9.docker
for runtime using ./tinker9.docker.py 10.1 runtime
(FYI I had to edit tinker9.docker
script because c += "string"
doesn't work if c
is not initialized in my python3 version) I still get a Dockerfile with tinker devel at top
FROM tinkertools/tinker9-devel:cuda10.1-nvhpc20.9 AS tool_image
FROM nvidia/cuda:10.1-runtime-ubuntu18.04
I will check out the docker image you posted. FYI I thought this was nice. This person's dockerhub repo thanked you!! :) https://hub.docker.com/r/xiangzezeng/tinker9. Do you know them?
Yes, Xiangze and I had a lot of discussions. And thanks for reporting the bug in the script.
As for the script you generated, before the tinkertools
repo is set up on docker hub, you'd (unfortunately) have to build the devel image locally (by ./tinker9.docker 10.1 devel
). This is going to take some time because downloading the compilers will take a while. Let me compile a more recent Tinker9 image and push it to https://hub.docker.com/r/zhiwang/tinker9/tags for now.
Thanks :)
Hi @zhi-wang, I tested running dynamic9 on the docker image zhiwang/tinker9:cuda10.1-20220407
and it works when either I have 1 GPU or if I have more than one and only set CUDA_VISIBLE_DEVICES
to just 1 device. I am assuming that tinker9 might not be able to run on multiple GPUs. I am running on V100s from RIS at WashU.
Quick note as you probably also have people starting to use Tinker9 User Guide at https://tinker9-manual.readthedocs.io/_/downloads/en/latest/pdf/. On Page 60 under "6.6 Parallelization" you have a misspelling CUDA-DEIVCE [integer]
instead of CUDA-DEVICE [integer]
. Just switch the "I" and "V" :)
It should be able to run on a multi-GPU machine. Each calculation can only utilize one GPU though.
Thanks,
So if CUDA_VISIBLE_DEVICES
= 1,2,3,0
for my 4 V100 GPUs what should I place for CUDA-DEVICE
keyword in my key file? If I do any of the following or don't even include CUDA-DEVICE
flag in key file at all:
CUDA-DEVICE 0
# or
CUDA-DEVICE 1
# or
CUDA-DEVICE 2
I get this error:
GPU Device : Setting Device ID to 0 from CUDA-DEVICE keyword
Backtrace
1 /home/tinker9/bin/gpu-m/dynamic9 0x67c0cf
2 /home/tinker9/bin/gpu-m/dynamic9 0x65ff40
3 /home/tinker9/bin/gpu-m/dynamic9 0x671c78
4 /home/tinker9/bin/gpu-m/dynamic9 0x410367
5 /lib/x86_64-linux-gnu/libc.so.6 __libc_start_main
6 /home/tinker9/bin/gpu-m/dynamic9 0x40f0ca
Terminating with uncaught exception : Errno 708 (cannot set while device is active in this process) at /home/tinker9/src/cudart/gpucard.cpp:282
But if I instead add CUDA-DEVICE 3
to the key file it works. Sorry if this is an "amateur CUDA" question as I do not know much about what the int
parameter "flags" is doing w.r.t the call to cudaSetDeviceFlags. The docs use vague language like plural flags and plural parameters so a single integer makes it sound confusing. Why can't we set all devices?
Do you think a good idea to add print statement before call to always_check_rt(cudaSetDeviceFlags(cuda_device_flags));
at line 282 like you do in function before for "Setting Device ID" something like below between lines 281 and 282 such as:
print(stdout,
"\n"
" Setting CUDA Device Flags : %d\n",
cuda_device_flags);
Or is this redundant?
Let me test it on one of our nodes to see if I can have it reappeared. This error message is infamous (to me) -- I could never recreate it. Your report described a new scenario Tinker9 is used and I am interested in what I can find later.
This is interesting. The different versions of Cuda runtimes have different behaviors -- the crash doesn't appear in the other docker image with Cuda 11.2. This is probably why I was never able to recreate this issue on our nodes.
I changed the code and pushed the new docker images to https://hub.docker.com/r/tinkertools/tinker9/tags. Please give it a try and let me know if you find any new problems. Thank you.
P.s. Images on my personal docker hub repo will be removed.
Ah yes that will make it much easier for Researchers to find :) I will test and let you know. Thanks!
Eventually RIS is going to update to 11.2, hopefully soon
So I am wondering if this new issue I get is 100% our LSF device setup. I get it to work on 2 GPUs and 4 GPUs half the time :). With 2 GPUs if CUDA_VISIBLE_DEVICES
= 1,2
and with 4 GPUs if CUDA_VISIBLE_DEVICES
= 1,2,3,0
the new image now works fine without having to do anything special with CUDA-DEVICES.
However every other time with 2 GPUs if CUDA_VISIBLE_DEVICES
= 2,0
and with 4 GPUs if CUDA_VISIBLE_DEVICES
= 2,0,3,1
I'll get a funky merge_sort error by from CUDA:
Terminating with uncaught exception : merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
I think this is actually related to my other issue when half the time I can get non-bonded interactions -nb gpu
for GROMACs and the other half the time it thinks no GPU is able to be found despite devices being listed.
Many things can be wrong for you to see this error. This error message was thrown out by a function of the thrust
library but I don't think this is where the crash happened. Actually, you saw this error message only because there is an explicit check for the error number inside a thrust
function call. If any bad things happened in other GPU calculations before this checkpoint, you wouldn't see them until this point, and only the last of the accumulated errors would be reported. It's not impossible to check the error code for every CUDA function call or kernel launch, but it won't be practical if we want the best performance.
Not long ago I saw the exact same error message. That was because I failed to allocate a block of GPU memory before it was used. It's hard for me to believe that you have a similar situation here because anything wrong in the code or in the parameters etc. would have similar crashing behavior on the same machine. The situation that only certain GPU cards consistently failed with the same error message makes me speculate that the issue is likely related to the machine setup.
I can't comment on what may go wrong with the machine setup, and I'm a little confused by your commands.
CUDA_VISIBLE_DEVICES=1,2,3,0
and CUDA_VISIBLE_DEVICES=2,0,3,1
? Why not 0,1,2,3
?CUDA_VISIBLE_DEVICES
for the calculation, I believe you don't need to worry about the CUDA-DEVICE [INTEGER]
keyword. Are you able to run 4 different jobs on 4 different cards independently on the same node?Ah ok so I thought Tinker might be able to split up the calculations but probably wouldn't make much sense unless you could allocate certain parts of the simulation such as PME calculations and/or non-bonded. Those orders for CUDA_VISIBLE_DEVICES
were the defaults when I bsub
into the GPU blade so didn't realize I could change them/test changing them. Regardless all the orders work fine now so running them probably around 30 minutes after LSF does their updates at 2:00 AM could have been issue. Thanks for the Docker image!
Hey @zhi-wang , thanks for making the dockerfile script. The dockerfile that I get from runtime or compile both have
FROM tinkertools/tinker9-devel:cuda10.1-nvhpc20.9
but I am not seeing your tinkertools repo on DockerHub so docker cannot find this image.For instance: