Closed gabemarx closed 3 years ago
I cannot find out the problem from this error. Did you strictly follow all the steps in README.md? Or what modification did you make? Maybe try CUDA_VISIBLE_DEVICES=0,1 python ...
to just use two GPUs.
My submission argument was: CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --img_dir /sc/arion/projects/tauomics/NFT_GAN/tiles --experiment_name NFT2
Are the other steps the same as README.md including the enviroment?
Yes I setup the environment according to the README.md but I am using my own dataset. "--img_dir" is pointing to a directory of +500,000 256x256 .png's
Ok, I will check the code. If possible, by the way, I suggest using the dataset of this code to check whether the error only happens with your own data. And maybe try CUDA_VISIBLE_DEVICES=0,1
to use fewer GPUs, using all GPUs of a device may cause an error (I don't know why).
The problem was fixed by deprecating my Cuda version from 11.1 to 7.0.28
Appreciate the help!
Hello, Exciting project. Could you please share the minimum resource requirements to train this model? I am getting memory errors training 500,000 256x256 images on 4 40GB A100 GPU's. Also my CPU memory is 128GB.
I tensorflow/stream_executor/stream.cc:1990] [stream=0x5570d9fd4600,impl=0x5570d9fd2050] did not wait for [stream=0x5570d9fd40a0,impl=0x5570d9fd2940] 2021-05-26 01:09:18.033272: I tensorflow/stream_executor/stream.cc:4925] [stream=0x5570d9fd4600,impl=0x5570d9fd2050] did not memcpy device-to-host; source: 0x2ac3e5d41a00 2021-05-26 01:09:18.033313: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed /hpc/users/marxg01/.lsbatch/1622001386.34151325.shell: line 23: 389547 Aborted (core dumped)