ashkamath / mdetr

Apache License 2.0
969 stars 125 forks source link

Error While Pretraining #28

Open mmaaz60 opened 3 years ago

mmaaz60 commented 3 years ago

Hi,

Thank you for the great work and for providing the pre-trained models. I was trying to run pre-training following the instructions at pretrain.md. I am getting the attached error. I am listing my environment details below. Any help would be appreciated. image

PyTorch: 1.9.0+cu11.1 TorchVision: 0.10.0 Transformers: 4.5.1 Hardware: A single machine with 4xRTX A6000

alcinos commented 3 years ago

Hi @mmaaz60 Thanks for your interest in MDETR.

Could you provide the following information to help debug your error?:

mmaaz60 commented 3 years ago

Hi @mmaaz60 Thanks for your interest in MDETR.

Could you provide the following information to help debug your error?:

Hi @alcinos,

Thank you for your reply. Please find the required information below.

  • Exact command line
    export CUBLAS_WORKSPACE_CONFIG=:4096:8
    python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 --batch_size 4 --output-dir ./mdetr/pretrain_batch_4
  • Did you change anything to the dataset?

No, I didn't change anything in the dataset.

  • Have you tried running it on one gpu first?

I have tried running the same command with --nproc_per_node=1 on the same machine and got the same error. However, I tried running a distributed training on PCs equipped with single GPUs and connected with LAN. The training started successfully.

  • Have you tried running on cpu first?

No, I didn't try that.