cannot do distributed training

volkancirik commented 3 years ago

Hello,

Thanks for open sourcing!

I try to run distributed training for pretraining. Without distributed training, it works fine.

I get the below error. I tried with pytorch versions 1.7.0, 1.7.1 and 1.8.0 They get below error. Version 1.9 gets ImportError: cannot import name '_new_empty_tensor' from 'torchvision.ops' **(/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/ops/__init__.py)** ``

I tried changing this line to losses.backward(retain_graph=True), it did not fix. Let me know if you have any suggestions on how to address this issue.

Traceback (most recent call last):
  File "main.py", line 643, in <module>
    main(args)
  File "main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/work/vcirik/mdetr/engine.py", line 100, in train_one_epoch
    losses.backward()
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 10]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operati\
on that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

toku-n commented 3 years ago

Similar error happens to me with the command

$ python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py --dataset_config configs/pretrain.json --ema

I enabled detect_anomaly and got the error log below.

/home/toku/mdetr/engine.py:53: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging. with torch.autograd.detect_anomaly(): [W python_anomaly_mode.cpp:104] Warning: Error detected in EmbeddingBackward. Traceback of forward call that caused the error: File "main.py", line 643, in main(args) File "main.py", line 558, in main model_ema=model_ema, File "/home/toku/mdetr/engine.py", line 69, in train_one_epoch memory_cache = model(samples, captions, encode_and_save=True) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/toku/mdetr/models/mdetr.py", line 143, in forward text_attention_mask=None, File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/toku/mdetr/models/transformer.py", line 121, in forward encoded_text = self.text_encoder(tokenized) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 833, in forward past_key_values_length=past_key_values_length, File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward token_type_embeddings = self.token_type_embeddings(token_type_ids) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 147, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 1913, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) (function _print_stack) Traceback (most recent call last): File "main.py", line 643, in main(args) File "main.py", line 558, in main model_ema=model_ema, File "/home/toku/mdetr/engine.py", line 101, in train_one_epoch losses.backward() File "/usr/local/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.6/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 14]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I tried python 3.8 and all torch versions(1.7.0,1.8.0 and 1.9.0) but got same error. any help?

p.s. @volkancirik About the error with torch==1.9.0, #15 might help you.

volkancirik commented 3 years ago

After fixing 1.9.0 bug, I get a new error.

Traceback (most recent call last): File "main.py", line 643, in <module> main(args) File "main.py", line 546, in main train_stats = train_one_epoch( File "/work/vcirik/mdetr/engine.py", line 79, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/work/vcirik/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/work/vcirik/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs) File "/work/vcirik/mdetr/models/mdetr.py", line 487, in loss_labels eos_coef[src_idx] = 1 RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flat\ tened indices did not match number of elements in the value tensor31 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 64656) of binary: /work/vcirik/anaconda3/envs/mdetr/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

toku-n commented 3 years ago

I tried pretraining with single GPU with the command below, and the error did not happen.

$ python main.py --dataset_config configs/pretrain.json --ema

If the problem is in-place operation, why the error did not happen with single GPU?

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 14]] is at version 3; expected version 2 instead.

toku-n commented 3 years ago

This may be the cause of this issue. https://github.com/pytorch/pytorch/issues/22095

The fix will be

diff --git a/main.py b/main.py
--- a/main.py
+++ b/main.py
@@ -320,7 +320,8 @@ def main(args):
     model_ema = deepcopy(model) if args.ema else None
     model_without_ddp = model
     if args.distributed:
-        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True, broadcast_buffers=False)
         model_without_ddp = model.module
     n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
     print("number of params:", n_parameters)

The error disappeared, but I am not yet sure whether it will affect training process.

volkancirik commented 3 years ago

@toku-n , after broadcast_buffers=False fix I still get RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED (screenshot below)

What's your torch version where single GPU distributed training is able to run.

Screen Shot 2021-07-12 at 12 32 14 PM

alcinos commented 3 years ago

Hi all

I have pushed a fix for the torchvision issue, let me know if that helps. For the distributed issues I would like to know the following:

What is the exact command that you are running
Can you give some info about the environment eg CUDA version, gpu model, etc
Did you make any change to the code?
Does the crash happen right at the beginning?
For indexing errors, I suggest trying first on cpu, by adding --device cpu to your command
Try downgrading the transformers version to 4.5.1. If that solves the issue, I'll pin it in the requirements.

Hope this helps

toku-n commented 3 years ago

Hi @volkancirik and @alcinos I think torch==1.9.0 causes the error RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED

Here is my environment

TITAN RTX
python 3.8.7
cuda 11.1.1 
cudnn 8.2.1

and I installed python libs with

$ pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
$ pip3 install -r requirements.txt

and my pip libs are

certifi==2021.5.30
chardet==4.0.0
click==8.0.1
cloudpickle==1.6.0
cycler==0.10.0
Cython==0.29.23
filelock==3.0.12
flatbuffers==2.0
huggingface-hub==0.0.12
idna==2.10
joblib==1.0.1
kiwisolver==1.3.1
matplotlib==3.4.2
numpy==1.21.0
onnx==1.9.0
onnxruntime==1.8.1
packaging==21.0
panopticapi==0.1
Pillow==8.3.1
prettytable==2.1.0
protobuf==3.17.3
pycocotools==2.0
pyparsing==2.4.7
python-dateutil==2.8.1
PyYAML==5.4.1
regex==2021.7.6
requests==2.25.1
sacremoses==0.0.45
scipy==1.7.0
six==1.16.0
submitit==1.3.3
timm==0.4.12
tokenizers==0.10.3
torch==1.8.1+cu111
torchvision==0.9.1+cu111
tqdm==4.61.2
transformers==4.8.2
typing-extensions==3.10.0.0
urllib3==1.26.6
wcwidth==0.2.5
xmltodict==0.12.0

I changed my code with #15 and the broadcast_buffers=False patch I shared in the comment above.

Will this help you?

alcinos commented 3 years ago

@toku-n Your env looks fine. I wouldn't go the broadcast_buffers=False route because I can't guarantee, from the top of my head, that it will not have unwanted side-effects. And we definitely didn't require it when training with pythorch 1.8.1

My best guess is the transformers version, try downgrading to 4.5.1 as I suggested above and see if this helps.

volkancirik commented 3 years ago

I can run when nproc_per_node=1 but cannot run when using multiple gpus.

TITAN RTX
python 3.8.10
cuda 11.1
cudnn 8005
pytorch 1.8.0

Log

$ CUBLAS_WORKSPACE_CONFIG=:16:8 python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --dataset_config configs/pretrain.json --ema
<de=4 --use_env main.py --dataset_config configs/pretrain.json --ema
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
| distributed init (rank 2): env://
| distributed init (rank 1): env://| distributed init (rank 3): env://

| distributed init (rank 0): env://
git:
  sha: N/A, status: clean, branch: N/A

Namespace(GT_type='separate', aux_loss=True, backbone='resnet101', batch_size=1, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='/work/vcirik/all_data/mscoco/', combine_datasets=['flickr'], combine_datasets_val=['flickr'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/pretrain.json', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, do_qa=False, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=-1, epochs=40, eval=False, eval_skip=1, flickr_ann_path='mdetr_annotations', flickr_dataset_path='/work/vcirik/flickr30k/github_flickr30k_entities/', flickr_img_path='/work/vcirik/flickr30k/images/', fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gpu=0, gqa_ann_path='mdetr_annotations', hidden_dim=256, load='', lr=0.0001, lr_backbone=1e-05, lr_drop=35, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=5, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=1, rank=0, refexp_ann_path='mdetr_annotations', refexp_dataset_name='all', remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=False, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=5e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='/work/vcirik/all_data/gqa/images/', weight_decay=0.0001, world_size=4)
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
number of params: 185160324
loading annotations into memory...
Done (t=16.14s)
creating index...
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
index created!
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
loading annotations into memory...
Done (t=0.49s)
creating index...
index created!
Start training
Starting epoch 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1614378083779/work/aten/src/THC/generic/THCTensorMath.cu line=29 error=715 : an illegal instruction was encountered
Traceback (most recent call last):
  File "main.py", line 682, in <module>
    main(args)
  File "main.py", line 582, in main
    train_stats = train_one_epoch(
  File "/work/vcirik/mdetr/engine.py", line 68, in train_one_epoch
    memory_cache = model(samples, captions, encode_and_save=True)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/mdetr/models/mdetr.py", line 129, in forward
    features, pos = self.backbone(samples)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/mdetr/models/backbone.py", line 169, in forward
    xs = self[0](tensor_list)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/mdetr/models/backbone.py", line 74, in forward
    xs = self.body(tensor_list.tensors)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward
    x = module(x)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/models/resnet.py", line 128, in forward
    out = self.conv2(out)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuda runtime error (715) : an illegal instruction was encountered at /opt/conda/conda-bld/pytorch_1614378083779/work/aten/src/THC/generic/THCTensorMath.cu:29
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1614378083779/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff3253662f2 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7ff32536367b in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7ff3255bf219 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff32534e3a4 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e0dda (0x7ff37c2c2dda in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e0e71 (0x7ff37c2c2e71 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55c734ddf2c6 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #7: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #8: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #9: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #10: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #11: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #12: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #13: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #14: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #15: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #16: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #17: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #18: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #19: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #20: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #21: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #22: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #23: <unknown function> + 0x1592ac (0x55c734da52ac in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #24: <unknown function> + 0x158e77 (0x55c734da4e77 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #25: <unknown function> + 0x158e60 (0x55c734da4e60 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #26: <unknown function> + 0x176057 (0x55c734dc2057 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #27: PyDict_SetItemString + 0x61 (0x55c734de33c1 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #28: PyImport_Cleanup + 0x9d (0x55c734e21aad in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #29: Py_FinalizeEx + 0x79 (0x55c734e53a49 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #30: Py_RunMain + 0x183 (0x55c734e55893 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #31: Py_BytesMain + 0x39 (0x55c734e55ca9 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #32: __libc_start_main + 0xf5 (0x7ff3c0c50c05 in /lib64/libc.so.6)
frame #33: <unknown function> + 0x1e21c7 (0x55c734e2e1c7 in /work/vcirik/anaconda3/envs/mdetr/bin/python)

Killing subprocess 17070
Killing subprocess 17071
Killing subprocess 17072
Killing subprocess 17073
Traceback (most recent call last):
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/work/vcirik/anaconda3/envs/mdetr/bin/python', '-u', 'main.py', '--dataset_config', 'configs/pretrain.json', '--ema']' died with <Signals.SIGABRT: 6>.

toku-n commented 3 years ago

Hi @alcinos, The error has gone with transformers==4.5.1 ! Now I don't need the patch of broadcast_buffers=False any more! Thank you so much !

TopCoder2K commented 2 years ago

@alcinos, thank you for your suggestion, but the error seems to have remained... I ran finetuning on the "all" split and got:

��G�error�X6Traceback (most recent call last):
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
    result = delayed.result()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "run_with_submitit.py", line 98, in __call__
    detection.main(self.args)
  File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR/mdetr/engine.py", line 73, in train_one_epoch
    loss_dict.update(criterion(outputs, targets, positive_map))
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 679, in forward
    losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes))
  File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 655, in get_loss
    return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 487, in loss_labels
    eos_coef[src_idx] = 1
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor51

in 10562_0_result.pkl file in experiments/ directory. How can I fix it?(

alcinos commented 2 years ago

@TopCoder2K This error is pretty uninformative. Please try debugging on cpu first (--device cpu) to see if you get more information.

TopCoder2K commented 2 years ago

@alcinos, thank you for the advice! I got the following output in job_number_0_log.err:

/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
submitit ERROR (2021-09-18 16:41:20,879) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
    result = delayed.result()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "run_with_submitit.py", line 98, in __call__
    detection.main(self.args)
  File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR/mdetr/engine.py", line 69, in train_one_epoch
    outputs = model(samples, captions, encode_and_save=False, memory_cache=memory_cache)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 154, in forward
    hs = self.transformer(
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 168, in forward
    hs = self.decoder(
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 232, in forward
    output = layer(
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 448, in forward
    return self.forward_post(
  File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 383, in forward_post
    tgt2 = self.cross_attn_image(
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1031, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 5082, in multi_head_attention_forward
    attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 4830, in _scaled_dot_product_attention
    attn = dropout(attn, p=dropout_p)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 1168, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 28712) is killed by signal: Killed.

What can be wrong with the dataloader? The command that I run looks like: python run_with_submitit.py --dataset_config configs/gqa.json --ngpus 1 --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --nodes 1 --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu.

In the command above I changed the load parameter (originally was --load https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth but this threw FileNotFoundError: [Errno 2] No such file or directory: 'https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth'; so I downloaded the checkpoint from https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth?download=1).

alcinos commented 2 years ago

The worker may have been called because of the host is out of ram. I’d suggest running locally first:

python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth  --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0

TopCoder2K commented 2 years ago

The worker may have been called because of the host is out of ram. I’d suggest running locally first:

python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth  --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0

Hmmm, I also encountered on that advice in pytorch issues, so I checked the RAM usage with free -m during the job was running and found out that 60 of 64 Gbs were used. Suspiciously...

I also tried to run the command with --num_workers 0 as you suggested, here the full output:

Not using distributed mode
git:
  sha: dda257d51a9944ee3e4201e7e52e50e5f9faec60, status: has uncommited changes, branch: main

Namespace(aux_loss=False, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='', combine_datasets=['gqa'], combine_datasets_val=['gqa'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/gqa.json', dec_layers=6, device='cpu', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, do_qa=True, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=25, epochs=125, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gqa_ann_path='mdetr_annotations/', gqa_split_type='all', hidden_dim=256, load='pretrained_resnet101_checkpoint.pth', lr=0.00014, lr_backbone=1.4e-05, lr_drop=150, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=0, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=25.0, remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=True, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=7e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='images/', weight_decay=0.0001, world_size=1)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((

number of params: 185879918
loading annotations into memory...
Done (t=233.60s)
creating index...
index created!
loading annotations into memory...
Done (t=45.16s)
creating index...
index created!
Splitting the training set into {args.epoch_chunks} of size approximately  652688
loading annotations into memory...
Done (t=0.45s)
creating index...
index created!
loading from pretrained_resnet101_checkpoint.pth
Start training
Starting epoch 0, sub_epoch 0
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Epoch: [0]  [     0/163172]  eta: 23 days, 4:50:57  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 553.8007 (553.8007)  loss_ce: 2.4964 (2.4964)  loss_bbox: 0.3178 (0.3178)  loss_giou: 0.5582 (0.5582)  loss_contrastive_align: 1.5884 (1.5884)  loss_answer_type: 40.6496 (40.6496)  loss_answer_obj: 0.0000 (0.0000)  loss_answer_attr: 132.8757 (132.8757)  loss_answer_rel: 204.5815 (204.5815)  loss_answer_global: 0.0000 (0.0000)  loss_answer_cat: 170.7331 (170.7331)  loss_ce_unscaled: 2.4964 (2.4964)  loss_bbox_unscaled: 0.0636 (0.0636)  loss_giou_unscaled: 0.2791 (0.2791)  cardinality_error_unscaled: 1.0000 (1.0000)  loss_contrastive_align_unscaled: 1.5884 (1.5884)  loss_answer_type_unscaled: 1.6260 (1.6260)  accuracy_answer_type_unscaled: 0.0000 (0.0000)  loss_answer_obj_unscaled: 0.0000 (0.0000)  accuracy_answer_obj_unscaled: 1.0000 (1.0000)  loss_answer_attr_unscaled: 5.3150 (5.3150)  accuracy_answer_attr_unscaled: 0.0000 (0.0000)  loss_answer_rel_unscaled: 8.1833 (8.1833)  accuracy_answer_rel_unscaled: 0.0000 (0.0000)  loss_answer_global_unscaled: 0.0000 (0.0000)  accuracy_answer_global_unscaled: 1.0000 (1.0000)  loss_answer_cat_unscaled: 6.8293 (6.8293)  accuracy_answer_cat_unscaled: 0.0000 (0.0000)  accuracy_answer_total_unscaled: 0.0000 (0.0000)  time: 12.2855  data: 0.2388  max mem: 0
Epoch: [0]  [    10/163172]  eta: 30 days, 5:15:36  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 406.9724 (418.6352)  loss_ce: 2.1498 (2.1541)  loss_bbox: 0.2972 (0.2831)  loss_giou: 0.4867 (0.4779)  loss_contrastive_align: 1.3373 (1.4061)  loss_answer_type: 39.5703 (38.9707)  loss_answer_obj: 12.8333 (16.3485)  loss_answer_attr: 153.8699 (140.3258)  loss_answer_rel: 187.6746 (186.9401)  loss_answer_global: 0.0000 (0.0000)  loss_answer_cat: 0.0000 (31.7289)  loss_ce_unscaled: 2.1498 (2.1541)  loss_bbox_unscaled: 0.0594 (0.0566)  loss_giou_unscaled: 0.2433 (0.2390)  cardinality_error_unscaled: 1.0000 (0.9091)  loss_contrastive_align_unscaled: 1.3373 (1.4061)  loss_answer_type_unscaled: 1.5828 (1.5588)  accuracy_answer_type_unscaled: 0.2500 (0.2727)  loss_answer_obj_unscaled: 0.5133 (0.6539)  accuracy_answer_obj_unscaled: 1.0000 (0.7273)  loss_answer_attr_unscaled: 6.1548 (5.6130)  accuracy_answer_attr_unscaled: 0.0000 (0.0909)  loss_answer_rel_unscaled: 7.5070 (7.4776)  accuracy_answer_rel_unscaled: 0.0000 (0.0000)  loss_answer_global_unscaled: 0.0000 (0.0000)  accuracy_answer_global_unscaled: 1.0000 (1.0000)  loss_answer_cat_unscaled: 0.0000 (1.2692)  accuracy_answer_cat_unscaled: 1.0000 (0.8182)  accuracy_answer_total_unscaled: 0.0000 (0.0000)  time: 16.0021  data: 0.0998  max mem: 0
Epoch: [0]  [    20/163172]  eta: 28 days, 19:04:06  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 376.5505 (392.1563)  loss_ce: 2.1498 (2.3245)  loss_bbox: 0.2712 (0.2893)  loss_giou: 0.4867 (0.5187)  loss_contrastive_align: 1.4424 (1.5074)  loss_answer_type: 36.7327 (36.6190)  loss_answer_obj: 12.8333 (15.8707)  loss_answer_attr: 136.2527 (106.2132)  loss_answer_rel: 183.5641 (184.4620)  loss_answer_global: 0.0000 (11.7262)  loss_answer_cat: 0.0000 (32.6253)  loss_ce_unscaled: 2.1498 (2.3245)  loss_bbox_unscaled: 0.0542 (0.0579)  loss_giou_unscaled: 0.2433 (0.2594)  cardinality_error_unscaled: 1.0000 (0.9286)  loss_contrastive_align_unscaled: 1.4424 (1.5074)  loss_answer_type_unscaled: 1.4693 (1.4648)  accuracy_answer_type_unscaled: 0.5000 (0.3929)  loss_answer_obj_unscaled: 0.5133 (0.6348)  accuracy_answer_obj_unscaled: 1.0000 (0.6667)  loss_answer_attr_unscaled: 5.4501 (4.2485)  accuracy_answer_attr_unscaled: 0.0000 (0.2857)  loss_answer_rel_unscaled: 7.3426 (7.3785)  accuracy_answer_rel_unscaled: 0.0000 (0.0159)  loss_answer_global_unscaled: 0.0000 (0.4690)  accuracy_answer_global_unscaled: 1.0000 (0.9048)  loss_answer_cat_unscaled: 0.0000 (1.3050)  accuracy_answer_cat_unscaled: 1.0000 (0.8095)  accuracy_answer_total_unscaled: 0.0000 (0.0119)  time: 15.3968  data: 0.0852  max mem: 0
Epoch: [0]  [    40/163172]  eta: 28 days, 0:20:23  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 312.4061 (340.1148)  loss_ce: 2.3961 (2.5253)  loss_bbox: 0.3259 (0.3168)  loss_giou: 0.4626 (0.5189)  loss_contrastive_align: 1.3347 (1.5488)  loss_answer_type: 35.9053 (35.5551)  loss_answer_obj: 16.3921 (14.5897)  loss_answer_attr: 110.3442 (100.5373)  loss_answer_rel: 140.3934 (150.9077)  loss_answer_global: 0.0000 (8.8592)  loss_answer_cat: 0.0000 (24.7559)  loss_ce_unscaled: 2.3961 (2.5253)  loss_bbox_unscaled: 0.0652 (0.0634)  loss_giou_unscaled: 0.2313 (0.2595)  cardinality_error_unscaled: 1.0000 (1.1037)  loss_contrastive_align_unscaled: 1.3347 (1.5488)  loss_answer_type_unscaled: 1.4362 (1.4222)  accuracy_answer_type_unscaled: 0.5000 (0.3902)  loss_answer_obj_unscaled: 0.6557 (0.5836)  accuracy_answer_obj_unscaled: 1.0000 (0.7154)  loss_answer_attr_unscaled: 4.4138 (4.0215)  accuracy_answer_attr_unscaled: 0.5000 (0.3780)  loss_answer_rel_unscaled: 5.6157 (6.0363)  accuracy_answer_rel_unscaled: 0.0000 (0.1911)  loss_answer_global_unscaled: 0.0000 (0.3544)  accuracy_answer_global_unscaled: 1.0000 (0.9268)  loss_answer_cat_unscaled: 0.0000 (0.9902)  accuracy_answer_cat_unscaled: 1.0000 (0.8537)  accuracy_answer_total_unscaled: 0.0000 (0.0488)  time: 14.4052  data: 0.0866  max mem: 0
Killed

The process was killed! So, the problem is that RAM is not enough? Also this answer is interesting but all the files can be read, at least (ls -l shows that every file can be read), so it seems this is not my case.

UPD2: also I tried python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --num_workers 1 and got (traceback is not full)

File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

But when I set CUBLAS_WORKSPACE_CONFIG=:4096:8, I get the old mistake

RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor51

TopCoder2K commented 2 years ago

Hmmmmm, I really confused... It seems I have different mistakes simultaneously, that's why I get different output every time... I've just rerun the command python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0 and got

Not using distributed mode
git:
  sha: dda257d51a9944ee3e4201e7e52e50e5f9faec60, status: has uncommited changes, branch: main

Namespace(aux_loss=False, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='', combine_datasets=['gqa'], combine_datasets_val=['gqa'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/gqa.json', dec_layers=6, device='cpu', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, do_qa=True, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=25, epochs=125, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gqa_ann_path='mdetr_annotations/', gqa_split_type='all', hidden_dim=256, load='pretrained_resnet101_checkpoint.pth', lr=0.00014, lr_backbone=1.4e-05, lr_drop=150, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=0, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=25.0, remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=True, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=7e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='images/', weight_decay=0.0001, world_size=1)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  warnings.warn((
number of params: 185879918
loading annotations into memory...
Done (t=216.72s)
creating index...
index created!
loading annotations into memory...
Done (t=42.99s)
creating index...
index created!
Splitting the training set into {args.epoch_chunks} of size approximately  652688
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
loading from pretrained_resnet101_checkpoint.pth
Start training
Starting epoch 0, sub_epoch 0
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Epoch: [0]  [     0/163172]  eta: 26 days, 3:45:42  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 553.8007 (553.8007)  loss_ce: 2.4964 (2.4964)  loss_bbox: 0.3178 (0.3178)  loss_giou: 0.5582 (0.5582)  loss_contrastive_align: 1.5884 (1.5884)  loss_answer_type: 40.6496 (40.6496)  loss_answer_obj: 0.0000 (0.0000)  loss_answer_attr: 132.8757 (132.8757)  loss_answer_rel: 204.5815 (204.5815)  loss_answer_global: 0.0000 (0.0000)  loss_answer_cat: 170.7331 (170.7331)  loss_ce_unscaled: 2.4964 (2.4964)  loss_bbox_unscaled: 0.0636 (0.0636)  loss_giou_unscaled: 0.2791 (0.2791)  cardinality_error_unscaled: 1.0000 (1.0000)  loss_contrastive_align_unscaled: 1.5884 (1.5884)  loss_answer_type_unscaled: 1.6260 (1.6260)  accuracy_answer_type_unscaled: 0.0000 (0.0000)  loss_answer_obj_unscaled: 0.0000 (0.0000)  accuracy_answer_obj_unscaled: 1.0000 (1.0000)  loss_answer_attr_unscaled: 5.3150 (5.3150)  accuracy_answer_attr_unscaled: 0.0000 (0.0000)  loss_answer_rel_unscaled: 8.1833 (8.1833)  accuracy_answer_rel_unscaled: 0.0000 (0.0000)  loss_answer_global_unscaled: 0.0000 (0.0000)  accuracy_answer_global_unscaled: 1.0000 (1.0000)  loss_answer_cat_unscaled: 6.8293 (6.8293)  accuracy_answer_cat_unscaled: 0.0000 (0.0000)  accuracy_answer_total_unscaled: 0.0000 (0.0000)  time: 13.8501  data: 0.6887  max mem: 0
Epoch: [0]  [    10/163172]  eta: 29 days, 3:12:39  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 406.9724 (418.6352)  loss_ce: 2.1498 (2.1541)  loss_bbox: 0.2972 (0.2831)  loss_giou: 0.4867 (0.4779)  loss_contrastive_align: 1.3373 (1.4061)  loss_answer_type: 39.5703 (38.9707)  loss_answer_obj: 12.8333 (16.3485)  loss_answer_attr: 153.8699 (140.3258)  loss_answer_rel: 187.6746 (186.9401)  loss_answer_global: 0.0000 (0.0000)  loss_answer_cat: 0.0000 (31.7289)  loss_ce_unscaled: 2.1498 (2.1541)  loss_bbox_unscaled: 0.0594 (0.0566)  loss_giou_unscaled: 0.2433 (0.2390)  cardinality_error_unscaled: 1.0000 (0.9091)  loss_contrastive_align_unscaled: 1.3373 (1.4061)  loss_answer_type_unscaled: 1.5828 (1.5588)  accuracy_answer_type_unscaled: 0.2500 (0.2727)  loss_answer_obj_unscaled: 0.5133 (0.6539)  accuracy_answer_obj_unscaled: 1.0000 (0.7273)  loss_answer_attr_unscaled: 6.1548 (5.6130)  accuracy_answer_attr_unscaled: 0.0000 (0.0909)  loss_answer_rel_unscaled: 7.5070 (7.4776)  accuracy_answer_rel_unscaled: 0.0000 (0.0000)  loss_answer_global_unscaled: 0.0000 (0.0000)  accuracy_answer_global_unscaled: 1.0000 (1.0000)  loss_answer_cat_unscaled: 0.0000 (1.2692)  accuracy_answer_cat_unscaled: 1.0000 (0.8182)  accuracy_answer_total_unscaled: 0.0000 (0.0000)  time: 15.4274  data: 0.1400  max mem: 0
Epoch: [0]  [    20/163172]  eta: 27 days, 14:41:18  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 376.5505 (392.1563)  loss_ce: 2.1498 (2.3245)  loss_bbox: 0.2712 (0.2893)  loss_giou: 0.4867 (0.5187)  loss_contrastive_align: 1.4424 (1.5074)  loss_answer_type: 36.7327 (36.6190)  loss_answer_obj: 12.8333 (15.8707)  loss_answer_attr: 136.2527 (106.2132)  loss_answer_rel: 183.5641 (184.4620)  loss_answer_global: 0.0000 (11.7262)  loss_answer_cat: 0.0000 (32.6253)  loss_ce_unscaled: 2.1498 (2.3245)  loss_bbox_unscaled: 0.0542 (0.0579)  loss_giou_unscaled: 0.2433 (0.2594)  cardinality_error_unscaled: 1.0000 (0.9286)  loss_contrastive_align_unscaled: 1.4424 (1.5074)  loss_answer_type_unscaled: 1.4693 (1.4648)  accuracy_answer_type_unscaled: 0.5000 (0.3929)  loss_answer_obj_unscaled: 0.5133 (0.6348)  accuracy_answer_obj_unscaled: 1.0000 (0.6667)  loss_answer_attr_unscaled: 5.4501 (4.2485)  accuracy_answer_attr_unscaled: 0.0000 (0.2857)  loss_answer_rel_unscaled: 7.3426 (7.3785)  accuracy_answer_rel_unscaled: 0.0000 (0.0159)  loss_answer_global_unscaled: 0.0000 (0.4690)  accuracy_answer_global_unscaled: 1.0000 (0.9048)  loss_answer_cat_unscaled: 0.0000 (1.3050)  accuracy_answer_cat_unscaled: 1.0000 (0.8095)  accuracy_answer_total_unscaled: 0.0000 (0.0119)  time: 14.6610  data: 0.0850  max mem: 0
Epoch: [0]  [    30/163172]  eta: 27 days, 8:44:25  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 323.8377 (359.1600)  loss_ce: 2.3897 (2.5281)  loss_bbox: 0.2712 (0.2944)  loss_giou: 0.4717 (0.5140)  loss_contrastive_align: 1.5241 (1.5773)  loss_answer_type: 34.4717 (35.7402)  loss_answer_obj: 15.2606 (14.9681)  loss_answer_attr: 121.9345 (104.8293)  loss_answer_rel: 166.2059 (164.8905)  loss_answer_global: 0.0000 (11.7170)  loss_answer_cat: 0.0000 (22.1010)  loss_ce_unscaled: 2.3897 (2.5281)  loss_bbox_unscaled: 0.0542 (0.0589)  loss_giou_unscaled: 0.2358 (0.2570)  cardinality_error_unscaled: 1.0000 (1.2016)  loss_contrastive_align_unscaled: 1.5241 (1.5773)  loss_answer_type_unscaled: 1.3789 (1.4296)  accuracy_answer_type_unscaled: 0.5000 (0.4032)  loss_answer_obj_unscaled: 0.6104 (0.5987)  accuracy_answer_obj_unscaled: 1.0000 (0.6935)  loss_answer_attr_unscaled: 4.8774 (4.1932)  accuracy_answer_attr_unscaled: 0.5000 (0.3387)  loss_answer_rel_unscaled: 6.6482 (6.5956)  accuracy_answer_rel_unscaled: 0.0000 (0.1398)  loss_answer_global_unscaled: 0.0000 (0.4687)  accuracy_answer_global_unscaled: 1.0000 (0.9032)  loss_answer_cat_unscaled: 0.0000 (0.8840)  accuracy_answer_cat_unscaled: 1.0000 (0.8710)  accuracy_answer_total_unscaled: 0.0000 (0.0403)  time: 13.9777  data: 0.0829  max mem: 0
Epoch: [0]  [    40/163172]  eta: 27 days, 14:40:18  lr: 0.000140  lr_backbone: 0.000014  lr_text_encoder: 0.000000  loss: 312.4061 (340.1148)  loss_ce: 2.3961 (2.5253)  loss_bbox: 0.3259 (0.3168)  loss_giou: 0.4626 (0.5189)  loss_contrastive_align: 1.3347 (1.5488)  loss_answer_type: 35.9053 (35.5551)  loss_answer_obj: 16.3921 (14.5897)  loss_answer_attr: 110.3442 (100.5373)  loss_answer_rel: 140.3934 (150.9077)  loss_answer_global: 0.0000 (8.8592)  loss_answer_cat: 0.0000 (24.7559)  loss_ce_unscaled: 2.3961 (2.5253)  loss_bbox_unscaled: 0.0652 (0.0634)  loss_giou_unscaled: 0.2313 (0.2595)  cardinality_error_unscaled: 1.0000 (1.1037)  loss_contrastive_align_unscaled: 1.3347 (1.5488)  loss_answer_type_unscaled: 1.4362 (1.4222)  accuracy_answer_type_unscaled: 0.5000 (0.3902)  loss_answer_obj_unscaled: 0.6557 (0.5836)  accuracy_answer_obj_unscaled: 1.0000 (0.7154)  loss_answer_attr_unscaled: 4.4138 (4.0215)  accuracy_answer_attr_unscaled: 0.5000 (0.3780)  loss_answer_rel_unscaled: 5.6157 (6.0363)  accuracy_answer_rel_unscaled: 0.0000 (0.1911)  loss_answer_global_unscaled: 0.0000 (0.3544)  accuracy_answer_global_unscaled: 1.0000 (0.9268)  loss_answer_cat_unscaled: 0.0000 (0.9902)  accuracy_answer_cat_unscaled: 1.0000 (0.8537)  accuracy_answer_total_unscaled: 0.0000 (0.0488)  time: 14.6253  data: 0.0905  max mem: 0
Traceback (most recent call last):
  File "main.py", line 643, in <module>
    main(args)
  File "main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR/mdetr/engine.py", line 100, in train_one_epoch
    losses.backward()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 244470272 bytes. Error code 12 (Cannot allocate memory)

And indeed the RAM ran out (I saw it with free -m), so it's not enough 64 Gb. But why do strange mistakes attack me earlier than Epoch: [0] [ 40/163172] when I use GPU?

TopCoder2K commented 2 years ago

I also tried to run evaluation on CLEVR, which is much smaller, using the command from the guide

python main.py --batch_size 64 --dataset_config configs/clevr.json --num_queries 25 --text_encoder_type distilroberta-base --backbone resnet18  --resume https://zenodo.org/record/4721981/files/clevr_checkpoint.pth  --eval

and again got

RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor1421

But I have 64 Gb RAM and 32 Gb GPU, so the whole dataset and the models can be easily allocated. It seems this is not a bug with resources... And then I run on CPU, it seems to work, consuming only about 11 Gb RAM.

alcinos commented 2 years ago

Try setting this line to false: https://github.com/ashkamath/mdetr/blob/0b747b99e2995c3c429f1391cb8e6104eaec7f21/main.py#L309

Also, could you paste the output of python -m torch.utils.collect_env ?

TopCoder2K commented 2 years ago

@alcinos, thank you for your support! 1) I changed the line on torch.set_deterministic(False) and the evaluation on CLEVR went successfully!

Accumulating evaluation results...
DONE (t=106.84s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.828
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.990
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.988
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.718
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.830
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.907
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.373
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.872
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.872
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.786
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.874
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.934
{'test_clevr_loss': 43.043012584842515, 'test_clevr_loss_ce': 5.731913995213883, 'test_clevr_loss_bbox': 0.05494903914992771, 'test_clevr_loss_giou': 0.18080522204549035, 'test_clevr_loss_contrastive_align': 0.739596603851473, 'test_clevr_loss_ce_0': 6.165245272397181, 'test_clevr_loss_bbox_0': 0.06370836580744324, 'test_clevr_loss_giou_0': 0.21175039811345905, 'test_clevr_loss_contrastive_align_0': 1.6258955893866438, 'test_clevr_loss_ce_1': 5.9863008091881005, 'test_clevr_loss_bbox_1': 0.06160141016855683, 'test_clevr_loss_giou_1': 0.20656590048007592, 'test_clevr_loss_contrastive_align_1': 1.2556593710804556, 'test_clevr_loss_ce_2': 5.867624813379281, 'test_clevr_loss_bbox_2': 0.059281312062726084, 'test_clevr_loss_giou_2': 0.19940767474091095, 'test_clevr_loss_contrastive_align_2': 1.0192198974077205, 'test_clevr_loss_ce_3': 5.771324536906167, 'test_clevr_loss_bbox_3': 0.05730514128536698, 'test_clevr_loss_giou_3': 0.194069205697486, 'test_clevr_loss_contrastive_align_3': 0.817971428532039, 'test_clevr_loss_ce_4': 5.741090293833827, 'test_clevr_loss_bbox_4': 0.05582752994519147, 'test_clevr_loss_giou_4': 0.18599585877292799, 'test_clevr_loss_contrastive_align_4': 0.7558641342681423, 'test_clevr_loss_answer_type': 3.604563896833151e-06, 'test_clevr_loss_answer_binary': 0.00632767150950992, 'test_clevr_loss_answer_reg': 0.025332900922053852, 'test_clevr_loss_answer_attr': 0.0023749297052867175, 'test_clevr_loss_ce_unscaled': 5.731913995213883, 'test_clevr_loss_bbox_unscaled': 0.010989807827442681, 'test_clevr_loss_giou_unscaled': 0.09040261102274517, 'test_clevr_cardinality_error_unscaled': 0.01639825085324232, 'test_clevr_loss_contrastive_align_unscaled': 0.739596603851473, 'test_clevr_loss_ce_0_unscaled': 6.165245272397181, 'test_clevr_loss_bbox_0_unscaled': 0.012741673153303818, 'test_clevr_loss_giou_0_unscaled': 0.10587519905672953, 'test_clevr_cardinality_error_0_unscaled': 1.9568324677976732, 'test_clevr_loss_contrastive_align_0_unscaled': 1.6258955893866438, 'test_clevr_loss_ce_1_unscaled': 5.9863008091881005, 'test_clevr_loss_bbox_1_unscaled': 0.012320282033075652, 'test_clevr_loss_giou_1_unscaled': 0.10328295024003796, 'test_clevr_cardinality_error_1_unscaled': 1.115309238815267, 'test_clevr_loss_contrastive_align_1_unscaled': 1.2556593710804556, 'test_clevr_loss_ce_2_unscaled': 5.867624813379281, 'test_clevr_loss_bbox_2_unscaled': 0.011856262410399679, 'test_clevr_loss_giou_2_unscaled': 0.09970383737045548, 'test_clevr_cardinality_error_2_unscaled': 0.5807346361705162, 'test_clevr_loss_contrastive_align_2_unscaled': 1.0192198974077205, 'test_clevr_loss_ce_3_unscaled': 5.771324536906167, 'test_clevr_loss_bbox_3_unscaled': 0.01146102826025197, 'test_clevr_loss_giou_3_unscaled': 0.097034602848743, 'test_clevr_cardinality_error_3_unscaled': 0.1629644974624363, 'test_clevr_loss_contrastive_align_3_unscaled': 0.817971428532039, 'test_clevr_loss_ce_4_unscaled': 5.741090293833827, 'test_clevr_loss_bbox_4_unscaled': 0.01116550598355525, 'test_clevr_loss_giou_4_unscaled': 0.09299792938646399, 'test_clevr_cardinality_error_4_unscaled': 0.04800581955033078, 'test_clevr_loss_contrastive_align_4_unscaled': 0.7558641342681423, 'test_clevr_loss_answer_type_unscaled': 3.604563896833151e-06, 'test_clevr_accuracy_answer_type_unscaled': 1.0, 'test_clevr_loss_answer_binary_unscaled': 0.00632767150950992, 'test_clevr_accuracy_answer_binary_unscaled': 0.9983339897855964, 'test_clevr_loss_answer_reg_unscaled': 0.025332900922053852, 'test_clevr_accuracy_answer_reg_unscaled': 0.9925868332182588, 'test_clevr_loss_answer_attr_unscaled': 0.0023749297052867175, 'test_clevr_accuracy_answer_attr_unscaled': 0.9996434087283376, 'test_clevr_accuracy_answer_total_unscaled': 0.9974136092150171, 'test_clevr_coco_eval_bbox': [0.8280577103267962, 0.9900612713060782, 0.9877238191805104, 0.7180399374430324, 0.8298110005371897, 0.9071876701127315, 0.37329648287548844, 0.8716828824051552, 0.8719843337521758, 0.7862159823816266, 0.8740682349128326, 0.9336712955657251], 'n_parameters': 111200939}

2) Here is the output of python -m torch.utils.collect_env:

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.8 (64-bit runtime)
Python platform: Linux-4.15.0-156-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.20.3           py38hf144106_0  
[conda] numpy-base                1.20.3           py38h74d4b33_0  
[conda] torch                     1.9.0                    pypi_0    pypi
[conda] torchvision               0.10.0                   pypi_0    pypi

I also try VQA2 again a bit later but this gave me hope! The only thing that confuses me is running with run_with_submitit.py not main.py. But let's try :)

UPD: as for VQA2, I had a normal error:

submitit ERROR (2021-09-20 22:20:53,411) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
    process_job(args.folder)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
    raise error
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
    result = delayed.result()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "run_with_submitit.py", line 98, in __call__
    detection.main(self.args)
  File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR/mdetr/engine.py", line 54, in train_one_epoch
    for i, batch_dict in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
  File "/home/pchelintsev/MDETR/mdetr/util/metrics.py", line 133, in log_every
    for obj in iterable:
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

It seems 64 Gb RAM isn't enough. How many did you use?...

4-0-4-notfound commented 2 years ago

use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP

TopCoder2K commented 2 years ago

use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP

Sorry, I haven't understood, @4-0-4-notfound.(( Could you provide more information? Where should I change the BatchNorm and why?

4-0-4-notfound commented 2 years ago

use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP

Sorry, I haven't understood, @4-0-4-notfound.(( Could you provide more information? Where should I change the BatchNorm and why?

In my case, it is the DDP bug with broadcast_buffers, meanwhile, the original BN has broadcast_buffers. Thus, i need to change the original BN into SyncBatchNorm to fix the broadcast_buffers bug in DDP. https://github.com/pytorch/pytorch/issues/22095#issuecomment-941522465

This is just a bugfix of the issuer i.e.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 10]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Maybe is it not working for yours.

ashkamath / mdetr

cannot do distributed training #16