Getting RuntimeError - Githubissues

snailrowen1337 commented 2 years ago

I am trying to use the script as follows:

python tools/train.py configs/ViTDet/ViTDet-ViT-Base-100e.py

This however crashes with

Traceback (most recent call last):
  File "tools/train.py", line 189, in <module>
    main()
  File "tools/train.py", line 185, in main
    meta=meta)
  File "/home/ViTDet/mmdet/apis/train.py", line 180, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/.local/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/.local/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/.local/lib/python3.6/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/.local/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
    runner.outputs['loss'].backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 256, 16, 16]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly 
detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Are there any other command line parameters that need to be set? Or am I using the script incorrectly? Thanks for open-sourcing this!

Annbless commented 2 years ago

Hi,

This is a strange problem that we have not encountered in our training.

This could be caused by the training environment, for example, the version of PyTorch or mmcv. We used PyTorch 1.9.0 and mmcv 1.3.9 in our implementation. perhaps you could try these two versions.

snailrowen1337 commented 2 years ago

Thanks for your reply! I've changed the PyTorch version and no longer see the error. But training appears very slow. On a single A100 GPU, I get eta: 82 days. This is not expected, right?

I've installed the package as per below. Are any changes needed for the installation?

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.3.9 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install -e .

Annbless commented 2 years ago

We used 4 A100 machines, i.e. 32 A100 GPUs, to train the ViT-B and ViTAE-B models. The training process took about 2 days. You can try smaller models, such as ViTAE-S, which we provide, to reduce the training cost. We are planning to train and release tiny size models.

snailrowen1337 commented 2 years ago

Ah got it, I thought you used 4 A100 GPUS. Then there's no discrepancy, thanks!

snailrowen1337 commented 2 years ago

Sorry to bother you again. I've increased the number of GPUs but still see a slowdown of roughly 2x compared to your results. I'm trying to troubleshoot this discrepancy. I get the following "warnings" during training:

worker-1: bucket_view.sizes() = [256, 768, 1, 1], strides() = [768, 1, 1, 1] (function operator())
worker-3: [W reducer.cpp:283] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.

Did you see the same warnings? If not, any other ideas on what the issue might be? Thanks!

Annbless commented 2 years ago

Yes, we have the same warning during training. Sometimes the number of workers can be an issue that hinders the speed of training.

snailrowen1337 commented 2 years ago

Got it, that might be the issue, thanks! Also, it seems like you are not using any pre-trained weights from e.g. MAE pretraining? Just wanted to confirm that if possible, thanks!

Annbless commented 2 years ago

No, we use pretrained weight from MAE for ViT-B and ViTAE-B, and weights from supervised training for ViTAE-S. We have not tried training the models without pretrained weights. It is an exciting topic to train the models from scratch. We will explore it if we have the chance.

Pretrained weights for ViT-B can be downloaded from MAE, and the pretrained weights for ViTAE-B can be found in ViTAE. The pretrained models can simply be loaded via specifying the path, i.e., --cfg-options model.backbone.pretrained=.

snailrowen1337 commented 2 years ago

Got it, that makes sense, thanks!

When passing the config ViTDet-ViT-Base-100e.py, beyond model.backbone.pretrained=XXX, do I need to pass any additional command-line arguments? Thanks!

Annbless commented 2 years ago

Other command line parameters are not required in the current version. Parameters such as learning rate, weight decay, and descent path rate are provided in the corresponding configuration files, and these parameters are set consistent with the paper.

snailrowen1337 commented 2 years ago

Got it, thanks for the clarifications!!

snailrowen1337 commented 2 years ago

I'm able to get the same results as you, but training is still slower than yours. The repo seems to include an apex-runner, but the config file does not seem to use this. Did you use apex when measuring your training speed? Thanks!

ViTAE-Transformer / ViTDet

Getting RuntimeError #3