LTH14 / rcg

PyTorch implementation of RCG https://arxiv.org/abs/2312.03701
MIT License
785 stars 36 forks source link

Error training moco-v3 #33

Closed creatorcao closed 3 months ago

creatorcao commented 3 months ago

Hi! Sorry to put the issue here, but the official moco-v3 repo has been archived and I really need help. I have been trying to train RCG from scratch, when training the moco-v3 encoder I got this error. Could you take a look at it and why is that? In the moco-v3 scripts all data and model have been set to cuda, it is because the inappropriate distributed-run caused the error?

torch 2.1.1+cu121 torchvision 0.16.1+cu121

torchrun --standalone --nproc_per_node=2 --nnodes=1 --node_rank=0 \
main_moco.py \
--arch=vit_base \
--batch-size=8 --workers=2 \
--optimizer=adamw --lr=1.5e-4 --weight-decay=0.1 \
--epochs=300 --warmup-epochs=40 \
--stop-grad-conv1 --moco-m-cos --moco-t=0.2 \
rcg/data/imagenet

The error is:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
=> creating model 'vit_base'=> creating model 'vit_base'

Traceback (most recent call last):
  File ".../moco-v3/main_moco.py", line 442, in <module>
    main()
  File ".../moco-v3/main_moco.py", line 156, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File ".../moco-v3/main_moco.py", line 309, in main_worker
    train(train_loader, model, optimizer, scaler, summary_writer, epoch, args)
  File ".../moco-v3/main_moco.py", line 357, in train
    loss = model(images[0], images[1], moco_m)
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 965, in forward
    output = self.module(*inputs, **kwargs)
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "...moco-v3/moco/builder.py", line 86, in forward
    q1 = self.predictor(self.base_encoder(x1))
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "../.conda/envs/moco/lib/python3.9/site-packages/timm/models/vision_transformer.py", line 307, in forward
    x = self.forward_features(x)
  File "../.conda/envs/moco/lib/python3.9/site-packages/timm/models/vision_transformer.py", line 292, in forward_features
    x = self.patch_embed(x)
  File "../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "../.conda/envs/moco/lib/python3.9/site-packages/timm/models/layers/patch_embed.py", line 34, in forward
    x = self.proj(x).flatten(2).transpose(1, 2)
  File "../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File ".../.conda/envs/moco/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor