Docker Image code generation fails

Longman-Stan commented 2 years ago

Hello!

While trying to run polycoder in a dockerized setup, we bumped into the error: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

Could you help us to get over this problem?

This is odd, because I guess the docker solution should have been error free.

We're trying to run the 160M version on a Ubuntu 22.04 machine with a GTX 1060 6GB by using the provided command to start the container.

Here is the full log: https://pastebin.com/JzayrXUr

VHellendoorn commented 2 years ago

Hi,

I’ve occasionally seen this error when the actual problem was the code running out of GPU memory. It is a bit odd to run out of memory with the smallest model, however. One thing to check is that you substituted the correct model size config when running the code, using small.yml instead of 2-7B.yml as in default command.

If that does not work, perhaps share the log before the prompting stage as well, including the container startup command.

-Vincent

Longman-Stan commented 2 years ago

I'm pretty sure I'm using the proper config: sudo python deepy.py generate.py configs/text_generation.yml checkpoints/checkpoints-160M/configs/local_setup_orig.yml checkpoints/checkpoints-160M/configs/small.yml It's really weird, because it works if you give it the context "for", but it fails if you add another character (fora). For starting the container I use: sudo nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PROJ_PATH/models/,dst=/gpt-neox/checkpoints vhellendoorn/code-lms-neox:base The rest of the log is here: https://pastebin.com/WaWdVT4f

VHellendoorn commented 2 years ago

Thanks for the additional details. That is quite surprising. Could you profile the GPU memory usage when prompting with just for? A command like watch -n1 nvidia-smi will refresh the memory profiling every second, which should allow capturing the peak memory usage. A quick profiling of the same checkpoint on my end shows that memory usage never exceeds 1.7GB. If the same is true for you, and the GPU indeed has 6GB freely available (sometimes competing processes, including the OS itself, may be claiming a part of the memory), the problem may not be the memory usage.

One potential lead that something else may be wrong is that the second log you shared ends with:

[2022-05-11 13:16:40,475] [WARNING] [engine.py:1519:load_checkpoint] Unable to find latest file at checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Unable to load checkpoint.

I downloaded the checkpoint file from Zenodo to double-check, and the checkpoints/latest files is included. It is probably worth double checking if that is true on your end as well; perhaps something went wrong with extracting the model archive.

Longman-Stan commented 2 years ago

While waiting for the prompt, the nvidia-smi shows that 1.6GB are used, so it shouldn't be a oom problem. When using the for "fora" prompt, it goes to 1.7, but it shouldn't be a problem.

I've also tried the 0.4B checkpoint. It exhibits the same behavior (but occupies up to 2.4GB of VRAM).

The checkpoint structure looks like this. There is a latest file, which contains the name (global_step150000).

VHellendoorn commented 2 years ago

Thanks for adding further details; it sounds like it's not the memory usage. My main suspicion right now is that the video card is incompatible with the CUDA version used in the docker image. A similar issue showed up here, though in the other direction (needing a higher version for an RTX 3090). That seems most likely because it is the only reason the image would fail to work on a different system. However, I could not find evidence that your GPU is no longer supported. Unfortuantely, I do not have any older GPUs available to test this out. This might explain the issue you saw where one prompt worked and the other didn't: per the vocabulary, for is a single token whereas fora gets split up into 2 or more parts. The latter may cause a shape mismatch with an older version of the low-level kernel, for instance if the batch and sequence dimension were swapped at some point. That is quite speculative, but seeing how the memory usage isn't the problem, a shape error seems most likely.

Other related issues I found reference the need to upgrade the PyTorch version, but that seems less plausible since the use of a Docker image should rule out version issues. It may still be worth trying from inside the container, to be sure.

I'm still a bit concerned about the training log claiming not to have found the latest checkpoint, given that it is present in your file system. It may be worth checking that the file system is as expected from inside the container.

Longman-Stan commented 2 years ago

And can I do anything about that? Cause it doesn't seem like it. Just to add. Before trying the dockerized version I've tried installing everything. I installed cuda 11.3, torch with cuda 11.3 and all the dependencies with no problem. However, I was getting the same cublas error. After seeing it does not work I've tried the conda version of torch. I got to a point where it's told me "no kernel image is available for execution on the device", so I hoped the docker would work.
That means a "python -m pip install torch -U " from the container, right? (PS: I've done this, it installed torch with cuda 10.2, but the problem persists)
Here is the structure from the container. Everything seems in place to me.

VHellendoorn commented 2 years ago

The closest alternative would be to build from source. It sounds like you may have tried such a thing already? If you were getting kernel issues outside the image, that might be the reason it fails within as well. While Docker offers complete OS isolation, the NVidia run-time specifically does interact with the host environment in ways that are beyond my expertise.

Upgrading Pytorch would take such a command, but please run it with sudo inside the container. All the packages are installed as root, so this upgrade will only take affect if run the same way.

The checkpoint structure shown inside the image is incorrect. The files in the checkpoints-160M folder should be mounted directly to /gpt-neox/checkpoints, without the 160M subfolder. So I suppose concretely in your run command, just append /checkpoints-160M to the src part of the mount flag.

Longman-Stan commented 2 years ago

Yes, building from source was the first thing to try. Interestingly enough, a colleague of mine said he managed to run the model with a GTX 1060 by building from source. I have done it, too (though without setting the interactive command, so the prompt was empty). In his case, he really managed to run it.

What I think would helf is if you could share the environment you're using. Maybe we can try to match it exactly, cuda versions and all.

And yes, the checkpoint structure was indeed incorrect. Now I don't get that error anymore: https://pastebin.com/A1ehH8ff. Sadly, the error persists, even with this. But reinstalling the torch with cuda 11.1 (the cuda versiom of docker, pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html) seems to have solved the issue.

I really hope we're able to finetune the model, now, but I guess we can close the issue now.

Thank you very much for all the help!!!

VHellendoorn commented 2 years ago

Very glad to hear the issue was resolved! It is quite interesting that the Docker-internal CUDA version did not match the one used for the PyTorch package. Especially since this was not an issue on other machines. I suppose it could be due to the CUDA version running outside the container, but I won't speculate as to why.

-Vincent

VHellendoorn / Code-LMs

Docker Image code generation fails #23