Open yotaroshimose opened 2 years ago
@yotaroshimose
bro,my english is not very good ,so maybe i wrongly understand your question,the following is my advice
you can add a statement -> CUDA_VISIBLE_DEVICES=0,1
(i have 4 GPU but only use the first and the second)
before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh
wish help you
Thank you for your kind reply. I will try the way of your device control. thank you.
Hi, Thank you for sharing your great work!
I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.
Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).
Any advise on successfully running your training code?
Thank you for your cooperation.