Resources required? - Githubissues

LynnHo / EigenGAN-Tensorflow

EigenGAN: Layer-Wise Eigen-Learning for GANs (ICCV 2021)

MIT License

344 stars 40 forks source link

Resources required? #1

Closed gabemarx closed 3 years ago

gabemarx commented 3 years ago

Hello, Exciting project. Could you please share the minimum resource requirements to train this model? I am getting memory errors training 500,000 256x256 images on 4 40GB A100 GPU's. Also my CPU memory is 128GB.

I tensorflow/stream_executor/stream.cc:1990] [stream=0x5570d9fd4600,impl=0x5570d9fd2050] did not wait for [stream=0x5570d9fd40a0,impl=0x5570d9fd2940] 2021-05-26 01:09:18.033272: I tensorflow/stream_executor/stream.cc:4925] [stream=0x5570d9fd4600,impl=0x5570d9fd2050] did not memcpy device-to-host; source: 0x2ac3e5d41a00 2021-05-26 01:09:18.033313: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed /hpc/users/marxg01/.lsbatch/1622001386.34151325.shell: line 23: 389547 Aborted (core dumped)

LynnHo commented 3 years ago

I cannot find out the problem from this error. Did you strictly follow all the steps in README.md? Or what modification did you make? Maybe try CUDA_VISIBLE_DEVICES=0,1 python ... to just use two GPUs.

gabemarx commented 3 years ago

My submission argument was: CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img_dir /sc/arion/projects/tauomics/NFT_GAN/tiles --experiment_name NFT2

LynnHo commented 3 years ago

Are the other steps the same as README.md including the enviroment?

gabemarx commented 3 years ago

Yes I setup the environment according to the README.md but I am using my own dataset. "--img_dir" is pointing to a directory of +500,000 256x256 .png's

LynnHo commented 3 years ago

Ok, I will check the code. If possible, by the way, I suggest using the dataset of this code to check whether the error only happens with your own data. And maybe try CUDA_VISIBLE_DEVICES=0,1 to use fewer GPUs, using all GPUs of a device may cause an error (I don't know why).

gabemarx commented 3 years ago

The problem was fixed by deprecating my Cuda version from 11.1 to 7.0.28

Appreciate the help!