Closed snailrowen1337 closed 2 years ago
Hi,
This is a strange problem that we have not encountered in our training.
This could be caused by the training environment, for example, the version of PyTorch or mmcv. We used PyTorch 1.9.0 and mmcv 1.3.9 in our implementation. perhaps you could try these two versions.
Thanks for your reply! I've changed the PyTorch version and no longer see the error. But training appears very slow. On a single A100 GPU, I get eta: 82 days. This is not expected, right?
I've installed the package as per below. Are any changes needed for the installation?
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.3.9 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install -e .
We used 4 A100 machines, i.e. 32 A100 GPUs, to train the ViT-B and ViTAE-B models. The training process took about 2 days. You can try smaller models, such as ViTAE-S, which we provide, to reduce the training cost. We are planning to train and release tiny size models.
Ah got it, I thought you used 4 A100 GPUS. Then there's no discrepancy, thanks!
Sorry to bother you again. I've increased the number of GPUs but still see a slowdown of roughly 2x compared to your results. I'm trying to troubleshoot this discrepancy. I get the following "warnings" during training:
worker-1: bucket_view.sizes() = [256, 768, 1, 1], strides() = [768, 1, 1, 1] (function operator())
worker-3: [W reducer.cpp:283] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
Did you see the same warnings? If not, any other ideas on what the issue might be? Thanks!
Yes, we have the same warning during training. Sometimes the number of workers can be an issue that hinders the speed of training.
Got it, that might be the issue, thanks! Also, it seems like you are not using any pre-trained weights from e.g. MAE pretraining? Just wanted to confirm that if possible, thanks!
No, we use pretrained weight from MAE for ViT-B and ViTAE-B, and weights from supervised training for ViTAE-S. We have not tried training the models without pretrained weights. It is an exciting topic to train the models from scratch. We will explore it if we have the chance.
Pretrained weights for ViT-B can be downloaded from MAE, and the pretrained weights for ViTAE-B can be found in ViTAE. The pretrained models can simply be loaded via specifying the path, i.e., --cfg-options model.backbone.pretrained=
Got it, that makes sense, thanks!
When passing the config ViTDet-ViT-Base-100e.py
, beyond model.backbone.pretrained=XXX
, do I need to pass any additional command-line arguments? Thanks!
Other command line parameters are not required in the current version. Parameters such as learning rate, weight decay, and descent path rate are provided in the corresponding configuration files, and these parameters are set consistent with the paper.
Got it, thanks for the clarifications!!
I'm able to get the same results as you, but training is still slower than yours. The repo seems to include an apex-runner, but the config file does not seem to use this. Did you use apex when measuring your training speed? Thanks!
I am trying to use the script as follows:
This however crashes with
Are there any other command line parameters that need to be set? Or am I using the script incorrectly? Thanks for open-sourcing this!