TengdaHan / CoCLR

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.
Apache License 2.0
286 stars 32 forks source link

is it possible to train main_coclr.py using single GPU? #16

Closed junmin98 closed 3 years ago

junmin98 commented 3 years ago

I have only one gpu.

I wanted to train, so I entered the terminal as follows: CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 main_coclr.py

but i got an error: subprocess.CalledProcessError: Command '['/home/junmin/anaconda3/envs/python36/bin/python', '-u', 'main_coclr.py', '--local_rank=0']' returned non-zero exit status 1.

Is there any way to train with a single GPU?

TengdaHan commented 3 years ago

Shuffle BatchNorm (https://github.com/TengdaHan/CoCLR/blob/main/model/pretrain.py#L98) has to be trained using more than one GPUs. We use code from MoCo (https://github.com/facebookresearch/moco) for this part. You can also check MoCo paper to know how Shuffle BatchNorm is done (https://arxiv.org/pdf/1911.05722.pdf, Section 3.3 "Shuffling BN") on more than one GPUs.

junmin98 commented 3 years ago

Thank you.. so I used 2 GPUs. and this error appears. (My Desktop has only one GPU, so I'm implementing it using the school server)

subprocess.CalledProcessError: Command '['/home/name/anaconda3/envs/pytorch/bin/python', '-u', 'main_coclr.py', '--local_rank=1']' died with <Signals.SIGBUS: 7>.

So I searched about this error, and they told me to reduce the number of workers. so i typed :

CUDA_VISIBLE_DEVICES=2, 3 python -m torch.distributed.launch --nproc_per_node=2 main_coclr.py --workers 2

I got Similar error: subprocess.CalledProcessError: Command '['/home/name/anaconda3/envs/pytorch/bin/python', '-u', 'main_coclr.py', '--local_rank=1', '--workers', '2']' died with <Signals.SIGBUS: 7>.

Do you happen to know what's wrong??

junmin98 commented 3 years ago

Thank you.. so I used 2 GPUs. and this error appears. (My Desktop has only one GPU, so I'm implementing it using the school server)

subprocess.CalledProcessError: Command '['/home/name/anaconda3/envs/pytorch/bin/python', '-u', 'main_coclr.py', '--local_rank=1']' died with <Signals.SIGBUS: 7>.

So I searched about this error, and they told me to reduce the number of workers. so i typed :

CUDA_VISIBLE_DEVICES=2, 3 python -m torch.distributed.launch --nproc_per_node=2 main_coclr.py --workers 2

I got Similar error: subprocess.CalledProcessError: Command '['/home/name/anaconda3/envs/pytorch/bin/python', '-u', 'main_coclr.py', '--local_rank=1', '--workers', '2']' died with <Signals.SIGBUS: 7>.

Do you happen to know what's wrong??

I solved this problem!!

TengdaHan commented 3 years ago

Glad to know! Can you share what's the reason, in case other people have the same issue?

junmin98 commented 3 years ago

sure! But maybe what I did wasn't exactly the solution.

The cause of the error for SIGBUS can be found here: https://www.geeksforgeeks.org/segmentation-fault-sigsegv-vs-bus-error-sigbus/ (I think this message appears when some error occurs somewhere, when code is running on the server.)

So when I checked each line of code, I found out that the error occurred when loading the dataset. But the flow dataset you shared is good, but there is an error when I call the frame dataset.

So I made the frame dataset again. I extracted the frame again and changed it to the format of the lmdb. And that solved the problem.