Dockerfile script - Githubissues

BJWiley233 commented 2 years ago

Hey @zhi-wang , thanks for making the dockerfile script. The dockerfile that I get from runtime or compile both have FROM tinkertools/tinker9-devel:cuda10.1-nvhpc20.9 but I am not seeing your tinkertools repo on DockerHub so docker cannot find this image.

For instance:

% docker pull tinkertools/tinker9-devel:cuda10.1-nvhpc20.9
Error response from daemon: pull access denied for tinkertools/tinker9-devel, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

zhi-wang commented 2 years ago

Hi, here is an image (https://hub.docker.com/r/zhiwang/tinker9/tags) I build in early March when I worked on the Dockerfiles. I don't know if anyone has tried it. We are not experienced docker users. Any suggestions and bug reports are greatly appreciated.

If you want to build your latest greatest Tinker9, it will take a while. Most of the time will be spent on downloading the NVHPC compilers. Generating the runtime image, compilation and installation will take much less time. For details, please find the python script tinker9.docker in the tinker9/docker directory and run tinker9.docker -h Currently, 2 CUDA versions (10.2 and 11.2.2) are supported by this script.

As you noted, tinkertools has not been set up on the docker hub yet. I need to ping Dr. Ponder @jayponder on this issue again. So before this repo is set up and the devel image is pushed to the repo, you'd have to build it yourself with this python script. I think everything is automated now. BTW the devel image we created for cuda 11.2.2 is 20+ G. We are not even sure if we will push the devel image to the repo in the future. Any suggestions on this?

BJWiley233 commented 2 years ago

Thanks for the response. RIS as WashU is still using 10.1 CUDA drivers. When I run the tinker9.docker for runtime using ./tinker9.docker.py 10.1 runtime (FYI I had to edit tinker9.docker script because c += "string" doesn't work if c is not initialized in my python3 version) I still get a Dockerfile with tinker devel at top

FROM tinkertools/tinker9-devel:cuda10.1-nvhpc20.9 AS tool_image
FROM nvidia/cuda:10.1-runtime-ubuntu18.04

I will check out the docker image you posted. FYI I thought this was nice. This person's dockerhub repo thanked you!! :) https://hub.docker.com/r/xiangzezeng/tinker9. Do you know them?

zhi-wang commented 2 years ago

Yes, Xiangze and I had a lot of discussions. And thanks for reporting the bug in the script.

As for the script you generated, before the tinkertools repo is set up on docker hub, you'd (unfortunately) have to build the devel image locally (by ./tinker9.docker 10.1 devel). This is going to take some time because downloading the compilers will take a while. Let me compile a more recent Tinker9 image and push it to https://hub.docker.com/r/zhiwang/tinker9/tags for now.

BJWiley233 commented 2 years ago

Thanks :)

BJWiley233 commented 2 years ago

Hi @zhi-wang, I tested running dynamic9 on the docker image zhiwang/tinker9:cuda10.1-20220407 and it works when either I have 1 GPU or if I have more than one and only set CUDA_VISIBLE_DEVICES to just 1 device. I am assuming that tinker9 might not be able to run on multiple GPUs. I am running on V100s from RIS at WashU.

Quick note as you probably also have people starting to use Tinker9 User Guide at https://tinker9-manual.readthedocs.io/_/downloads/en/latest/pdf/. On Page 60 under "6.6 Parallelization" you have a misspelling CUDA-DEIVCE [integer] instead of CUDA-DEVICE [integer]. Just switch the "I" and "V" :)

zhi-wang commented 2 years ago

It should be able to run on a multi-GPU machine. Each calculation can only utilize one GPU though.

BJWiley233 commented 2 years ago

Thanks,

So if CUDA_VISIBLE_DEVICES = 1,2,3,0 for my 4 V100 GPUs what should I place for CUDA-DEVICE keyword in my key file? If I do any of the following or don't even include CUDA-DEVICE flag in key file at all:

CUDA-DEVICE 0
# or
CUDA-DEVICE 1
# or
CUDA-DEVICE 2

I get this error:

GPU Device :  Setting Device ID to 0 from CUDA-DEVICE keyword
 Backtrace
    1  /home/tinker9/bin/gpu-m/dynamic9                              0x67c0cf
    2  /home/tinker9/bin/gpu-m/dynamic9                              0x65ff40
    3  /home/tinker9/bin/gpu-m/dynamic9                              0x671c78
    4  /home/tinker9/bin/gpu-m/dynamic9                              0x410367
    5  /lib/x86_64-linux-gnu/libc.so.6                               __libc_start_main
    6  /home/tinker9/bin/gpu-m/dynamic9                              0x40f0ca
 Terminating with uncaught exception :  Errno 708 (cannot set while device is active in this process) at /home/tinker9/src/cudart/gpucard.cpp:282

But if I instead add CUDA-DEVICE 3 to the key file it works. Sorry if this is an "amateur CUDA" question as I do not know much about what the int parameter "flags" is doing w.r.t the call to cudaSetDeviceFlags. The docs use vague language like plural flags and plural parameters so a single integer makes it sound confusing. Why can't we set all devices?

BJWiley233 commented 2 years ago

Do you think a good idea to add print statement before call to always_check_rt(cudaSetDeviceFlags(cuda_device_flags)); at line 282 like you do in function before for "Setting Device ID" something like below between lines 281 and 282 such as:

 print(stdout,
      "\n"
      " Setting CUDA Device Flags :  %d\n",
      cuda_device_flags);

Or is this redundant?

zhi-wang commented 2 years ago

Let me test it on one of our nodes to see if I can have it reappeared. This error message is infamous (to me) -- I could never recreate it. Your report described a new scenario Tinker9 is used and I am interested in what I can find later.

zhi-wang commented 2 years ago

This is interesting. The different versions of Cuda runtimes have different behaviors -- the crash doesn't appear in the other docker image with Cuda 11.2. This is probably why I was never able to recreate this issue on our nodes.

I changed the code and pushed the new docker images to https://hub.docker.com/r/tinkertools/tinker9/tags. Please give it a try and let me know if you find any new problems. Thank you.

P.s. Images on my personal docker hub repo will be removed.

BJWiley233 commented 2 years ago

Ah yes that will make it much easier for Researchers to find :) I will test and let you know. Thanks!

Eventually RIS is going to update to 11.2, hopefully soon

BJWiley233 commented 2 years ago

So I am wondering if this new issue I get is 100% our LSF device setup. I get it to work on 2 GPUs and 4 GPUs half the time :). With 2 GPUs if CUDA_VISIBLE_DEVICES = 1,2 and with 4 GPUs if CUDA_VISIBLE_DEVICES = 1,2,3,0 the new image now works fine without having to do anything special with CUDA-DEVICES.

However every other time with 2 GPUs if CUDA_VISIBLE_DEVICES = 2,0 and with 4 GPUs if CUDA_VISIBLE_DEVICES = 2,0,3,1 I'll get a funky merge_sort error by from CUDA:

 Terminating with uncaught exception :  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

I think this is actually related to my other issue when half the time I can get non-bonded interactions -nb gpu for GROMACs and the other half the time it thinks no GPU is able to be found despite devices being listed.

zhi-wang commented 2 years ago

Many things can be wrong for you to see this error. This error message was thrown out by a function of the thrust library but I don't think this is where the crash happened. Actually, you saw this error message only because there is an explicit check for the error number inside a thrust function call. If any bad things happened in other GPU calculations before this checkpoint, you wouldn't see them until this point, and only the last of the accumulated errors would be reported. It's not impossible to check the error code for every CUDA function call or kernel launch, but it won't be practical if we want the best performance.

Not long ago I saw the exact same error message. That was because I failed to allocate a block of GPU memory before it was used. It's hard for me to believe that you have a similar situation here because anything wrong in the code or in the parameters etc. would have similar crashing behavior on the same machine. The situation that only certain GPU cards consistently failed with the same error message makes me speculate that the issue is likely related to the machine setup.

I can't comment on what may go wrong with the machine setup, and I'm a little confused by your commands.

What is the difference between, for instance, CUDA_VISIBLE_DEVICES=1,2,3,0 and CUDA_VISIBLE_DEVICES=2,0,3,1? Why not 0,1,2,3?
As I explained in one of the previous posts, Tinker9 can only utilize one GPU card for a calculation. What is the purpose to set multiple visible devices to the program?
If you just assign one device number via CUDA_VISIBLE_DEVICES for the calculation, I believe you don't need to worry about the CUDA-DEVICE [INTEGER] keyword. Are you able to run 4 different jobs on 4 different cards independently on the same node?
Moreover, if the calculation doesn't take an enormous amount of GPU memory, is it possible to run multiple jobs on the same card? (Speed can be N times slower if the computing chips are already fully loaded.) If you see errors here, can you tell what may have forbidden it? If everything still works fine, how is this different than your situation?

BJWiley233 commented 2 years ago

Ah ok so I thought Tinker might be able to split up the calculations but probably wouldn't make much sense unless you could allocate certain parts of the simulation such as PME calculations and/or non-bonded. Those orders for CUDA_VISIBLE_DEVICES were the defaults when I bsub into the GPU blade so didn't realize I could change them/test changing them. Regardless all the orders work fine now so running them probably around 30 minutes after LSF does their updates at 2:00 AM could have been issue. Thanks for the Docker image!

TinkerTools / tinker9

Dockerfile script #190