amazon-science / earth-forecasting-transformer

Official implementation of Earthformer
Apache License 2.0
349 stars 58 forks source link

Installation without Apex #55

Closed Astralex98 closed 11 months ago

Astralex98 commented 12 months ago

Hello!

  1. Is it possible to run your code without apex installation? Anyway I wouldn't use any distributed training so it seems that I don't need apex.

  2. I have only cuda-9.0 version on my machine and your code needs lightning >= 1.6.4 which is, as far as I understand, incompatible with pytorch==1.1.0 and torchvision==0.3.0 which are stable versions for cuda-9.0. What should I do?

gaozhihan commented 12 months ago

Question 1

Sure. Taking N-body MNIST as an example, you can simply disable the usage of Apex by deleting the corresponding lines, i.e., line 26 and line 430 in train_cuboid_nbody.py. Then run the training command without the settings related to multi-GPU communication:

python train_cuboid_nbody.py --cfg cfg.yaml --ckpt_name last.ckpt --save tmp_nbody

Question 2

We haven't tested the compatibility with lower versions of torch, pytorch_lightning, and cuda. My suggestion is to install cuda-11.6 or cuda-11.7, ensuring compatibility with the packages specified in this repository.

Astralex98 commented 12 months ago

Thanks for quick response!

I deleted corresponding lines and errors with Apex are gone. But when I run python train_cuboid_nbody.py --cfg cfg.yaml --ckpt_name last.ckpt --save tmp_nbody I got following error: cudnn_status

But when I run the the same comand another time I got another error (but on the same lines of code): cublas_status

Here (https://discuss.pytorch.org/t/runtimeerror-cuda-error-cublas-status-alloc-failed-when-calling-cublascreate-handle/78545/28) I found that reducing batch sizes can help. I set total_batch_size = 1 and micro_batch_size = 1 but it didn't solve my problem.

One of the possible problems can be mismatching between versions of torch, torchvision and cuda. As you suggested I installed torch==1.12.1+cu116 torchvision==0.13.1+cu116 and pytorch_lightning==1.6.4. nvidia-smi command shows CUDA Version: 11.0 but in /usr/local/cuda folder I have only cuda-8.0 and cuda-9.0. Does it mean that code uses cuda-9.0 version but libraries are for cuda-11.6?

gaozhihan commented 11 months ago

It can be problematic if your Nvidia driver version and CUDA version are mismatched. A workaround to try Earthformer without setting up CUDA is to use our minimal test case.

Astralex98 commented 11 months ago

Seems that my problem is solved now. I think the issue was in overloaded server which caused the error.

Thanks again for quick supporting!