Closed Astralex98 closed 11 months ago
Sure. Taking N-body MNIST as an example, you can simply disable the usage of Apex
by deleting the corresponding lines, i.e., line 26 and line 430 in train_cuboid_nbody.py
. Then run the training command without the settings related to multi-GPU communication:
python train_cuboid_nbody.py --cfg cfg.yaml --ckpt_name last.ckpt --save tmp_nbody
We haven't tested the compatibility with lower versions of torch
, pytorch_lightning
, and cuda
. My suggestion is to install cuda-11.6
or cuda-11.7
, ensuring compatibility with the packages specified in this repository.
Thanks for quick response!
I deleted corresponding lines and errors with Apex are gone. But when I run python train_cuboid_nbody.py --cfg cfg.yaml --ckpt_name last.ckpt --save tmp_nbody
I got following error:
But when I run the the same comand another time I got another error (but on the same lines of code):
Here (https://discuss.pytorch.org/t/runtimeerror-cuda-error-cublas-status-alloc-failed-when-calling-cublascreate-handle/78545/28) I found that reducing batch sizes can help. I set total_batch_size = 1
and micro_batch_size = 1
but it didn't solve my problem.
One of the possible problems can be mismatching between versions of torch
, torchvision
and cuda
. As you suggested I installed torch==1.12.1+cu116 torchvision==0.13.1+cu116
and pytorch_lightning==1.6.4
. nvidia-smi
command shows CUDA Version: 11.0
but in /usr/local/cuda
folder I have only cuda-8.0
and cuda-9.0
. Does it mean that code uses cuda-9.0
version but libraries are for cuda-11.6
?
It can be problematic if your Nvidia driver version and CUDA version are mismatched. A workaround to try Earthformer without setting up CUDA is to use our minimal test case.
Seems that my problem is solved now. I think the issue was in overloaded server which caused the error.
Thanks again for quick supporting!
Hello!
Is it possible to run your code without apex installation? Anyway I wouldn't use any distributed training so it seems that I don't need apex.
I have only cuda-9.0 version on my machine and your code needs lightning >= 1.6.4 which is, as far as I understand, incompatible with pytorch==1.1.0 and torchvision==0.3.0 which are stable versions for cuda-9.0. What should I do?