facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.25k stars 331 forks source link

How can i use the quick_simclr_2node.yaml get the RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 2 (while checking arguments for cudnn_convolution) #476

Open CauchyFood opened 2 years ago

CauchyFood commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

I only use the quick_simclr_2node get the error about that:

  File "/mnt/cache/user/miniconda/envs/ivssl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/cache/user/miniconda/envs/ivssl/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/mnt/cache/user/miniconda/envs/ivssl/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 0 does not equal 2 (while checking arguments for cudnn_convolution)

How can I resolve this question?

FYI:

python3 tools/run_distributed_engines.py \
    hydra.verbose=true \
    config.DATA.TRAIN.DATASET_NAMES=[imagenet1k_folder] \
    config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
    config.DATA.TRAIN.DATA_PATHS=["/mnt/cache/liwei1/data/imagenet-1k/train"] \
    config=test/integration_test/quick_simclr_2node \
    config.DISTRIBUTED.RUN_ID=my_ip:port \
    config.CHECKPOINT.DIR="./checkpoints" \
    config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=10 \
    config.DATA.NUM_DATALOADER_WORKERS=1 \
    config.HOOKS.TENSORBOARD_SETUP.USE_TENSORBOARD=true
iseessel commented 2 years ago

@CauchyFood Can you please confirm that you are running this command on 2 nodes, each with 8gpus per node (that is 16-gpus in total)? This command will fail otherwise.

If you are indeed running on 2 nodes + 8gpus per node, can you please post the full log.txt + the system information.

wget -nc -q https://github.com/facebookresearch/vissl/raw/main/vissl/utils/collect_env.py && python collect_env.py