Closed volkancirik closed 3 years ago
Similar error happens to me with the command
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py --dataset_config configs/pretrain.json --ema
I enabled detect_anomaly and got the error log below.
/home/toku/mdetr/engine.py:53: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging. with torch.autograd.detect_anomaly(): [W python_anomaly_mode.cpp:104] Warning: Error detected in EmbeddingBackward. Traceback of forward call that caused the error: File "main.py", line 643, in
main(args) File "main.py", line 558, in main model_ema=model_ema, File "/home/toku/mdetr/engine.py", line 69, in train_one_epoch memory_cache = model(samples, captions, encode_and_save=True) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/toku/mdetr/models/mdetr.py", line 143, in forward text_attention_mask=None, File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/toku/mdetr/models/transformer.py", line 121, in forward encoded_text = self.text_encoder(tokenized) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 833, in forward past_key_values_length=past_key_values_length, File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward token_type_embeddings = self.token_type_embeddings(token_type_ids) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 147, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 1913, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) (function _print_stack) Traceback (most recent call last): File "main.py", line 643, in main(args) File "main.py", line 558, in main model_ema=model_ema, File "/home/toku/mdetr/engine.py", line 101, in train_one_epoch losses.backward() File "/usr/local/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.6/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 14]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I tried python 3.8 and all torch versions(1.7.0,1.8.0 and 1.9.0) but got same error. any help?
p.s. @volkancirik About the error with torch==1.9.0, #15 might help you.
After fixing 1.9.0 bug, I get a new error.
Traceback (most recent call last): File "main.py", line 643, in <module> main(args) File "main.py", line 546, in main train_stats = train_one_epoch( File "/work/vcirik/mdetr/engine.py", line 79, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/work/vcirik/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/work/vcirik/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs) File "/work/vcirik/mdetr/models/mdetr.py", line 487, in loss_labels eos_coef[src_idx] = 1 RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flat\ tened indices did not match number of elements in the value tensor31 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 64656) of binary: /work/vcirik/anaconda3/envs/mdetr/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
I tried pretraining with single GPU with the command below, and the error did not happen.
$ python main.py --dataset_config configs/pretrain.json --ema
If the problem is in-place operation, why the error did not happen with single GPU?
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 14]] is at version 3; expected version 2 instead.
This may be the cause of this issue. https://github.com/pytorch/pytorch/issues/22095
The fix will be
diff --git a/main.py b/main.py
--- a/main.py
+++ b/main.py
@@ -320,7 +320,8 @@ def main(args):
model_ema = deepcopy(model) if args.ema else None
model_without_ddp = model
if args.distributed:
- model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
+ model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True, broadcast_buffers=False)
model_without_ddp = model.module
n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("number of params:", n_parameters)
The error disappeared, but I am not yet sure whether it will affect training process.
@toku-n , after broadcast_buffers=False
fix I still get RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED
(screenshot below)
What's your torch version where single GPU distributed training is able to run.
Hi all
I have pushed a fix for the torchvision issue, let me know if that helps. For the distributed issues I would like to know the following:
Hope this helps
Hi @volkancirik and @alcinos
I think torch==1.9.0 causes the error RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED
Here is my environment
TITAN RTX
python 3.8.7
cuda 11.1.1
cudnn 8.2.1
and I installed python libs with
$ pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
$ pip3 install -r requirements.txt
and my pip libs are
certifi==2021.5.30
chardet==4.0.0
click==8.0.1
cloudpickle==1.6.0
cycler==0.10.0
Cython==0.29.23
filelock==3.0.12
flatbuffers==2.0
huggingface-hub==0.0.12
idna==2.10
joblib==1.0.1
kiwisolver==1.3.1
matplotlib==3.4.2
numpy==1.21.0
onnx==1.9.0
onnxruntime==1.8.1
packaging==21.0
panopticapi==0.1
Pillow==8.3.1
prettytable==2.1.0
protobuf==3.17.3
pycocotools==2.0
pyparsing==2.4.7
python-dateutil==2.8.1
PyYAML==5.4.1
regex==2021.7.6
requests==2.25.1
sacremoses==0.0.45
scipy==1.7.0
six==1.16.0
submitit==1.3.3
timm==0.4.12
tokenizers==0.10.3
torch==1.8.1+cu111
torchvision==0.9.1+cu111
tqdm==4.61.2
transformers==4.8.2
typing-extensions==3.10.0.0
urllib3==1.26.6
wcwidth==0.2.5
xmltodict==0.12.0
I changed my code with #15 and the broadcast_buffers=False patch I shared in the comment above.
Will this help you?
@toku-n Your env looks fine. I wouldn't go the broadcast_buffers=False route because I can't guarantee, from the top of my head, that it will not have unwanted side-effects. And we definitely didn't require it when training with pythorch 1.8.1
My best guess is the transformers version, try downgrading to 4.5.1 as I suggested above and see if this helps.
I can run when nproc_per_node=1
but cannot run when using multiple gpus.
TITAN RTX
python 3.8.10
cuda 11.1
cudnn 8005
pytorch 1.8.0
Log
$ CUBLAS_WORKSPACE_CONFIG=:16:8 python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --dataset_config configs/pretrain.json --ema
<de=4 --use_env main.py --dataset_config configs/pretrain.json --ema
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
| distributed init (rank 2): env://
| distributed init (rank 1): env://| distributed init (rank 3): env://
| distributed init (rank 0): env://
git:
sha: N/A, status: clean, branch: N/A
Namespace(GT_type='separate', aux_loss=True, backbone='resnet101', batch_size=1, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='/work/vcirik/all_data/mscoco/', combine_datasets=['flickr'], combine_datasets_val=['flickr'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/pretrain.json', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, do_qa=False, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=-1, epochs=40, eval=False, eval_skip=1, flickr_ann_path='mdetr_annotations', flickr_dataset_path='/work/vcirik/flickr30k/github_flickr30k_entities/', flickr_img_path='/work/vcirik/flickr30k/images/', fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gpu=0, gqa_ann_path='mdetr_annotations', hidden_dim=256, load='', lr=0.0001, lr_backbone=1e-05, lr_drop=35, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=5, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=1, rank=0, refexp_ann_path='mdetr_annotations', refexp_dataset_name='all', remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=False, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=5e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='/work/vcirik/all_data/gqa/images/', weight_decay=0.0001, world_size=4)
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/__init__.py:421: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
number of params: 185160324
loading annotations into memory...
Done (t=16.14s)
creating index...
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
index created!
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/utils/data/dataloader.py:474: UserWarning: This DataLoader will create 5 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
loading annotations into memory...
Done (t=0.49s)
creating index...
index created!
Start training
Starting epoch 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1614378083779/work/aten/src/THC/generic/THCTensorMath.cu line=29 error=715 : an illegal instruction was encountered
Traceback (most recent call last):
File "main.py", line 682, in <module>
main(args)
File "main.py", line 582, in main
train_stats = train_one_epoch(
File "/work/vcirik/mdetr/engine.py", line 68, in train_one_epoch
memory_cache = model(samples, captions, encode_and_save=True)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/mdetr/models/mdetr.py", line 129, in forward
features, pos = self.backbone(samples)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/mdetr/models/backbone.py", line 169, in forward
xs = self[0](tensor_list)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/mdetr/models/backbone.py", line 74, in forward
xs = self.body(tensor_list.tensors)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward
x = module(x)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/models/resnet.py", line 128, in forward
out = self.conv2(out)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuda runtime error (715) : an illegal instruction was encountered at /opt/conda/conda-bld/pytorch_1614378083779/work/aten/src/THC/generic/THCTensorMath.cu:29
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal instruction was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1614378083779/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff3253662f2 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7ff32536367b in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7ff3255bf219 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff32534e3a4 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e0dda (0x7ff37c2c2dda in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e0e71 (0x7ff37c2c2e71 in /work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x1932c6 (0x55c734ddf2c6 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #7: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #8: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #9: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #10: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #11: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #12: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #13: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #14: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #15: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #16: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #17: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #18: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #19: <unknown function> + 0x15878b (0x55c734da478b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #20: <unknown function> + 0xe81c4 (0x55c734d341c4 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #21: <unknown function> + 0x15893b (0x55c734da493b in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #22: <unknown function> + 0x193141 (0x55c734ddf141 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #23: <unknown function> + 0x1592ac (0x55c734da52ac in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #24: <unknown function> + 0x158e77 (0x55c734da4e77 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #25: <unknown function> + 0x158e60 (0x55c734da4e60 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #26: <unknown function> + 0x176057 (0x55c734dc2057 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #27: PyDict_SetItemString + 0x61 (0x55c734de33c1 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #28: PyImport_Cleanup + 0x9d (0x55c734e21aad in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #29: Py_FinalizeEx + 0x79 (0x55c734e53a49 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #30: Py_RunMain + 0x183 (0x55c734e55893 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #31: Py_BytesMain + 0x39 (0x55c734e55ca9 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
frame #32: __libc_start_main + 0xf5 (0x7ff3c0c50c05 in /lib64/libc.so.6)
frame #33: <unknown function> + 0x1e21c7 (0x55c734e2e1c7 in /work/vcirik/anaconda3/envs/mdetr/bin/python)
Killing subprocess 17070
Killing subprocess 17071
Killing subprocess 17072
Killing subprocess 17073
Traceback (most recent call last):
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/work/vcirik/anaconda3/envs/mdetr/bin/python', '-u', 'main.py', '--dataset_config', 'configs/pretrain.json', '--ema']' died with <Signals.SIGABRT: 6>.
Hi @alcinos, The error has gone with transformers==4.5.1 ! Now I don't need the patch of broadcast_buffers=False any more! Thank you so much !
@alcinos, thank you for your suggestion, but the error seems to have remained... I ran finetuning on the "all" split and got:
��G�error�X6Traceback (most recent call last):
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File "run_with_submitit.py", line 98, in __call__
detection.main(self.args)
File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
train_stats = train_one_epoch(
File "/home/pchelintsev/MDETR/mdetr/engine.py", line 73, in train_one_epoch
loss_dict.update(criterion(outputs, targets, positive_map))
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 679, in forward
losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes))
File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 655, in get_loss
return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 487, in loss_labels
eos_coef[src_idx] = 1
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor51
in 10562_0_result.pkl
file in experiments/
directory.
How can I fix it?(
@TopCoder2K This error is pretty uninformative. Please try debugging on cpu first (--device cpu) to see if you get more information.
@alcinos, thank you for the advice! I got the following output in job_number_0_log.err
:
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
submitit ERROR (2021-09-18 16:41:20,879) - Submitted job triggered an exception
Traceback (most recent call last):
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
process_job(args.folder)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
raise error
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File "run_with_submitit.py", line 98, in __call__
detection.main(self.args)
File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
train_stats = train_one_epoch(
File "/home/pchelintsev/MDETR/mdetr/engine.py", line 69, in train_one_epoch
outputs = model(samples, captions, encode_and_save=False, memory_cache=memory_cache)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/mdetr.py", line 154, in forward
hs = self.transformer(
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 168, in forward
hs = self.decoder(
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 232, in forward
output = layer(
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 448, in forward
return self.forward_post(
File "/home/pchelintsev/MDETR/mdetr/models/transformer.py", line 383, in forward_post
tgt2 = self.cross_attn_image(
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1031, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 5082, in multi_head_attention_forward
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 4830, in _scaled_dot_product_attention
attn = dropout(attn, p=dropout_p)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 1168, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 28712) is killed by signal: Killed.
What can be wrong with the dataloader? The command that I run looks like:
python run_with_submitit.py --dataset_config configs/gqa.json --ngpus 1 --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --nodes 1 --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu
.
In the command above I changed the load
parameter (originally was --load https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth
but this threw FileNotFoundError: [Errno 2] No such file or directory: 'https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth'
; so I downloaded the checkpoint from https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth?download=1).
The worker may have been called because of the host is out of ram. I’d suggest running locally first:
python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0
The worker may have been called because of the host is out of ram. I’d suggest running locally first:
python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0
Hmmm, I also encountered on that advice in pytorch issues, so I checked the RAM usage with free -m
during the job was running and found out that 60 of 64 Gbs were used. Suspiciously...
I also tried to run the command with --num_workers 0
as you suggested, here the full output:
Not using distributed mode
git:
sha: dda257d51a9944ee3e4201e7e52e50e5f9faec60, status: has uncommited changes, branch: main
Namespace(aux_loss=False, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='', combine_datasets=['gqa'], combine_datasets_val=['gqa'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/gqa.json', dec_layers=6, device='cpu', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, do_qa=True, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=25, epochs=125, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gqa_ann_path='mdetr_annotations/', gqa_split_type='all', hidden_dim=256, load='pretrained_resnet101_checkpoint.pth', lr=0.00014, lr_backbone=1.4e-05, lr_drop=150, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=0, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=25.0, remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=True, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=7e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='images/', weight_decay=0.0001, world_size=1)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
number of params: 185879918
loading annotations into memory...
Done (t=233.60s)
creating index...
index created!
loading annotations into memory...
Done (t=45.16s)
creating index...
index created!
Splitting the training set into {args.epoch_chunks} of size approximately 652688
loading annotations into memory...
Done (t=0.45s)
creating index...
index created!
loading from pretrained_resnet101_checkpoint.pth
Start training
Starting epoch 0, sub_epoch 0
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Epoch: [0] [ 0/163172] eta: 23 days, 4:50:57 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 553.8007 (553.8007) loss_ce: 2.4964 (2.4964) loss_bbox: 0.3178 (0.3178) loss_giou: 0.5582 (0.5582) loss_contrastive_align: 1.5884 (1.5884) loss_answer_type: 40.6496 (40.6496) loss_answer_obj: 0.0000 (0.0000) loss_answer_attr: 132.8757 (132.8757) loss_answer_rel: 204.5815 (204.5815) loss_answer_global: 0.0000 (0.0000) loss_answer_cat: 170.7331 (170.7331) loss_ce_unscaled: 2.4964 (2.4964) loss_bbox_unscaled: 0.0636 (0.0636) loss_giou_unscaled: 0.2791 (0.2791) cardinality_error_unscaled: 1.0000 (1.0000) loss_contrastive_align_unscaled: 1.5884 (1.5884) loss_answer_type_unscaled: 1.6260 (1.6260) accuracy_answer_type_unscaled: 0.0000 (0.0000) loss_answer_obj_unscaled: 0.0000 (0.0000) accuracy_answer_obj_unscaled: 1.0000 (1.0000) loss_answer_attr_unscaled: 5.3150 (5.3150) accuracy_answer_attr_unscaled: 0.0000 (0.0000) loss_answer_rel_unscaled: 8.1833 (8.1833) accuracy_answer_rel_unscaled: 0.0000 (0.0000) loss_answer_global_unscaled: 0.0000 (0.0000) accuracy_answer_global_unscaled: 1.0000 (1.0000) loss_answer_cat_unscaled: 6.8293 (6.8293) accuracy_answer_cat_unscaled: 0.0000 (0.0000) accuracy_answer_total_unscaled: 0.0000 (0.0000) time: 12.2855 data: 0.2388 max mem: 0
Epoch: [0] [ 10/163172] eta: 30 days, 5:15:36 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 406.9724 (418.6352) loss_ce: 2.1498 (2.1541) loss_bbox: 0.2972 (0.2831) loss_giou: 0.4867 (0.4779) loss_contrastive_align: 1.3373 (1.4061) loss_answer_type: 39.5703 (38.9707) loss_answer_obj: 12.8333 (16.3485) loss_answer_attr: 153.8699 (140.3258) loss_answer_rel: 187.6746 (186.9401) loss_answer_global: 0.0000 (0.0000) loss_answer_cat: 0.0000 (31.7289) loss_ce_unscaled: 2.1498 (2.1541) loss_bbox_unscaled: 0.0594 (0.0566) loss_giou_unscaled: 0.2433 (0.2390) cardinality_error_unscaled: 1.0000 (0.9091) loss_contrastive_align_unscaled: 1.3373 (1.4061) loss_answer_type_unscaled: 1.5828 (1.5588) accuracy_answer_type_unscaled: 0.2500 (0.2727) loss_answer_obj_unscaled: 0.5133 (0.6539) accuracy_answer_obj_unscaled: 1.0000 (0.7273) loss_answer_attr_unscaled: 6.1548 (5.6130) accuracy_answer_attr_unscaled: 0.0000 (0.0909) loss_answer_rel_unscaled: 7.5070 (7.4776) accuracy_answer_rel_unscaled: 0.0000 (0.0000) loss_answer_global_unscaled: 0.0000 (0.0000) accuracy_answer_global_unscaled: 1.0000 (1.0000) loss_answer_cat_unscaled: 0.0000 (1.2692) accuracy_answer_cat_unscaled: 1.0000 (0.8182) accuracy_answer_total_unscaled: 0.0000 (0.0000) time: 16.0021 data: 0.0998 max mem: 0
Epoch: [0] [ 20/163172] eta: 28 days, 19:04:06 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 376.5505 (392.1563) loss_ce: 2.1498 (2.3245) loss_bbox: 0.2712 (0.2893) loss_giou: 0.4867 (0.5187) loss_contrastive_align: 1.4424 (1.5074) loss_answer_type: 36.7327 (36.6190) loss_answer_obj: 12.8333 (15.8707) loss_answer_attr: 136.2527 (106.2132) loss_answer_rel: 183.5641 (184.4620) loss_answer_global: 0.0000 (11.7262) loss_answer_cat: 0.0000 (32.6253) loss_ce_unscaled: 2.1498 (2.3245) loss_bbox_unscaled: 0.0542 (0.0579) loss_giou_unscaled: 0.2433 (0.2594) cardinality_error_unscaled: 1.0000 (0.9286) loss_contrastive_align_unscaled: 1.4424 (1.5074) loss_answer_type_unscaled: 1.4693 (1.4648) accuracy_answer_type_unscaled: 0.5000 (0.3929) loss_answer_obj_unscaled: 0.5133 (0.6348) accuracy_answer_obj_unscaled: 1.0000 (0.6667) loss_answer_attr_unscaled: 5.4501 (4.2485) accuracy_answer_attr_unscaled: 0.0000 (0.2857) loss_answer_rel_unscaled: 7.3426 (7.3785) accuracy_answer_rel_unscaled: 0.0000 (0.0159) loss_answer_global_unscaled: 0.0000 (0.4690) accuracy_answer_global_unscaled: 1.0000 (0.9048) loss_answer_cat_unscaled: 0.0000 (1.3050) accuracy_answer_cat_unscaled: 1.0000 (0.8095) accuracy_answer_total_unscaled: 0.0000 (0.0119) time: 15.3968 data: 0.0852 max mem: 0
Epoch: [0] [ 40/163172] eta: 28 days, 0:20:23 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 312.4061 (340.1148) loss_ce: 2.3961 (2.5253) loss_bbox: 0.3259 (0.3168) loss_giou: 0.4626 (0.5189) loss_contrastive_align: 1.3347 (1.5488) loss_answer_type: 35.9053 (35.5551) loss_answer_obj: 16.3921 (14.5897) loss_answer_attr: 110.3442 (100.5373) loss_answer_rel: 140.3934 (150.9077) loss_answer_global: 0.0000 (8.8592) loss_answer_cat: 0.0000 (24.7559) loss_ce_unscaled: 2.3961 (2.5253) loss_bbox_unscaled: 0.0652 (0.0634) loss_giou_unscaled: 0.2313 (0.2595) cardinality_error_unscaled: 1.0000 (1.1037) loss_contrastive_align_unscaled: 1.3347 (1.5488) loss_answer_type_unscaled: 1.4362 (1.4222) accuracy_answer_type_unscaled: 0.5000 (0.3902) loss_answer_obj_unscaled: 0.6557 (0.5836) accuracy_answer_obj_unscaled: 1.0000 (0.7154) loss_answer_attr_unscaled: 4.4138 (4.0215) accuracy_answer_attr_unscaled: 0.5000 (0.3780) loss_answer_rel_unscaled: 5.6157 (6.0363) accuracy_answer_rel_unscaled: 0.0000 (0.1911) loss_answer_global_unscaled: 0.0000 (0.3544) accuracy_answer_global_unscaled: 1.0000 (0.9268) loss_answer_cat_unscaled: 0.0000 (0.9902) accuracy_answer_cat_unscaled: 1.0000 (0.8537) accuracy_answer_total_unscaled: 0.0000 (0.0488) time: 14.4052 data: 0.0866 max mem: 0
Killed
The process was killed! So, the problem is that RAM is not enough?
Also this answer is interesting but all the files can be read, at least (ls -l
shows that every file can be read), so it seems this is not my case.
UPD2: also I tried python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --num_workers 1
and got (traceback is not full)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
But when I set CUBLAS_WORKSPACE_CONFIG=:4096:8, I get the old mistake
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor51
Hmmmmm, I really confused... It seems I have different mistakes simultaneously, that's why I get different output every time...
I've just rerun the command python main.py --dataset_config configs/gqa.json --ema --epochs 125 --epoch_chunks 25 --do_qa --split_qa_heads --lr_drop 150 --load pretrained_resnet101_checkpoint.pth --batch_size 4 --no_aux_loss --qa_loss_coef 25 --lr 1.4e-4 --lr_backbone 1.4e-5 --text_encoder_lr 7e-5 --device cpu --num_workers 0
and got
Not using distributed mode
git:
sha: dda257d51a9944ee3e4201e7e52e50e5f9faec60, status: has uncommited changes, branch: main
Namespace(aux_loss=False, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='', combine_datasets=['gqa'], combine_datasets_val=['gqa'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/gqa.json', dec_layers=6, device='cpu', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, do_qa=True, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=25, epochs=125, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gqa_ann_path='mdetr_annotations/', gqa_split_type='all', hidden_dim=256, load='pretrained_resnet101_checkpoint.pth', lr=0.00014, lr_backbone=1.4e-05, lr_drop=150, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=0, optimizer='adam', output_dir='', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=25.0, remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=True, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=7e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='images/', weight_decay=0.0001, world_size=1)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/__init__.py:471: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
warnings.warn((
number of params: 185879918
loading annotations into memory...
Done (t=216.72s)
creating index...
index created!
loading annotations into memory...
Done (t=42.99s)
creating index...
index created!
Splitting the training set into {args.epoch_chunks} of size approximately 652688
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
loading from pretrained_resnet101_checkpoint.pth
Start training
Starting epoch 0, sub_epoch 0
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Epoch: [0] [ 0/163172] eta: 26 days, 3:45:42 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 553.8007 (553.8007) loss_ce: 2.4964 (2.4964) loss_bbox: 0.3178 (0.3178) loss_giou: 0.5582 (0.5582) loss_contrastive_align: 1.5884 (1.5884) loss_answer_type: 40.6496 (40.6496) loss_answer_obj: 0.0000 (0.0000) loss_answer_attr: 132.8757 (132.8757) loss_answer_rel: 204.5815 (204.5815) loss_answer_global: 0.0000 (0.0000) loss_answer_cat: 170.7331 (170.7331) loss_ce_unscaled: 2.4964 (2.4964) loss_bbox_unscaled: 0.0636 (0.0636) loss_giou_unscaled: 0.2791 (0.2791) cardinality_error_unscaled: 1.0000 (1.0000) loss_contrastive_align_unscaled: 1.5884 (1.5884) loss_answer_type_unscaled: 1.6260 (1.6260) accuracy_answer_type_unscaled: 0.0000 (0.0000) loss_answer_obj_unscaled: 0.0000 (0.0000) accuracy_answer_obj_unscaled: 1.0000 (1.0000) loss_answer_attr_unscaled: 5.3150 (5.3150) accuracy_answer_attr_unscaled: 0.0000 (0.0000) loss_answer_rel_unscaled: 8.1833 (8.1833) accuracy_answer_rel_unscaled: 0.0000 (0.0000) loss_answer_global_unscaled: 0.0000 (0.0000) accuracy_answer_global_unscaled: 1.0000 (1.0000) loss_answer_cat_unscaled: 6.8293 (6.8293) accuracy_answer_cat_unscaled: 0.0000 (0.0000) accuracy_answer_total_unscaled: 0.0000 (0.0000) time: 13.8501 data: 0.6887 max mem: 0
Epoch: [0] [ 10/163172] eta: 29 days, 3:12:39 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 406.9724 (418.6352) loss_ce: 2.1498 (2.1541) loss_bbox: 0.2972 (0.2831) loss_giou: 0.4867 (0.4779) loss_contrastive_align: 1.3373 (1.4061) loss_answer_type: 39.5703 (38.9707) loss_answer_obj: 12.8333 (16.3485) loss_answer_attr: 153.8699 (140.3258) loss_answer_rel: 187.6746 (186.9401) loss_answer_global: 0.0000 (0.0000) loss_answer_cat: 0.0000 (31.7289) loss_ce_unscaled: 2.1498 (2.1541) loss_bbox_unscaled: 0.0594 (0.0566) loss_giou_unscaled: 0.2433 (0.2390) cardinality_error_unscaled: 1.0000 (0.9091) loss_contrastive_align_unscaled: 1.3373 (1.4061) loss_answer_type_unscaled: 1.5828 (1.5588) accuracy_answer_type_unscaled: 0.2500 (0.2727) loss_answer_obj_unscaled: 0.5133 (0.6539) accuracy_answer_obj_unscaled: 1.0000 (0.7273) loss_answer_attr_unscaled: 6.1548 (5.6130) accuracy_answer_attr_unscaled: 0.0000 (0.0909) loss_answer_rel_unscaled: 7.5070 (7.4776) accuracy_answer_rel_unscaled: 0.0000 (0.0000) loss_answer_global_unscaled: 0.0000 (0.0000) accuracy_answer_global_unscaled: 1.0000 (1.0000) loss_answer_cat_unscaled: 0.0000 (1.2692) accuracy_answer_cat_unscaled: 1.0000 (0.8182) accuracy_answer_total_unscaled: 0.0000 (0.0000) time: 15.4274 data: 0.1400 max mem: 0
Epoch: [0] [ 20/163172] eta: 27 days, 14:41:18 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 376.5505 (392.1563) loss_ce: 2.1498 (2.3245) loss_bbox: 0.2712 (0.2893) loss_giou: 0.4867 (0.5187) loss_contrastive_align: 1.4424 (1.5074) loss_answer_type: 36.7327 (36.6190) loss_answer_obj: 12.8333 (15.8707) loss_answer_attr: 136.2527 (106.2132) loss_answer_rel: 183.5641 (184.4620) loss_answer_global: 0.0000 (11.7262) loss_answer_cat: 0.0000 (32.6253) loss_ce_unscaled: 2.1498 (2.3245) loss_bbox_unscaled: 0.0542 (0.0579) loss_giou_unscaled: 0.2433 (0.2594) cardinality_error_unscaled: 1.0000 (0.9286) loss_contrastive_align_unscaled: 1.4424 (1.5074) loss_answer_type_unscaled: 1.4693 (1.4648) accuracy_answer_type_unscaled: 0.5000 (0.3929) loss_answer_obj_unscaled: 0.5133 (0.6348) accuracy_answer_obj_unscaled: 1.0000 (0.6667) loss_answer_attr_unscaled: 5.4501 (4.2485) accuracy_answer_attr_unscaled: 0.0000 (0.2857) loss_answer_rel_unscaled: 7.3426 (7.3785) accuracy_answer_rel_unscaled: 0.0000 (0.0159) loss_answer_global_unscaled: 0.0000 (0.4690) accuracy_answer_global_unscaled: 1.0000 (0.9048) loss_answer_cat_unscaled: 0.0000 (1.3050) accuracy_answer_cat_unscaled: 1.0000 (0.8095) accuracy_answer_total_unscaled: 0.0000 (0.0119) time: 14.6610 data: 0.0850 max mem: 0
Epoch: [0] [ 30/163172] eta: 27 days, 8:44:25 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 323.8377 (359.1600) loss_ce: 2.3897 (2.5281) loss_bbox: 0.2712 (0.2944) loss_giou: 0.4717 (0.5140) loss_contrastive_align: 1.5241 (1.5773) loss_answer_type: 34.4717 (35.7402) loss_answer_obj: 15.2606 (14.9681) loss_answer_attr: 121.9345 (104.8293) loss_answer_rel: 166.2059 (164.8905) loss_answer_global: 0.0000 (11.7170) loss_answer_cat: 0.0000 (22.1010) loss_ce_unscaled: 2.3897 (2.5281) loss_bbox_unscaled: 0.0542 (0.0589) loss_giou_unscaled: 0.2358 (0.2570) cardinality_error_unscaled: 1.0000 (1.2016) loss_contrastive_align_unscaled: 1.5241 (1.5773) loss_answer_type_unscaled: 1.3789 (1.4296) accuracy_answer_type_unscaled: 0.5000 (0.4032) loss_answer_obj_unscaled: 0.6104 (0.5987) accuracy_answer_obj_unscaled: 1.0000 (0.6935) loss_answer_attr_unscaled: 4.8774 (4.1932) accuracy_answer_attr_unscaled: 0.5000 (0.3387) loss_answer_rel_unscaled: 6.6482 (6.5956) accuracy_answer_rel_unscaled: 0.0000 (0.1398) loss_answer_global_unscaled: 0.0000 (0.4687) accuracy_answer_global_unscaled: 1.0000 (0.9032) loss_answer_cat_unscaled: 0.0000 (0.8840) accuracy_answer_cat_unscaled: 1.0000 (0.8710) accuracy_answer_total_unscaled: 0.0000 (0.0403) time: 13.9777 data: 0.0829 max mem: 0
Epoch: [0] [ 40/163172] eta: 27 days, 14:40:18 lr: 0.000140 lr_backbone: 0.000014 lr_text_encoder: 0.000000 loss: 312.4061 (340.1148) loss_ce: 2.3961 (2.5253) loss_bbox: 0.3259 (0.3168) loss_giou: 0.4626 (0.5189) loss_contrastive_align: 1.3347 (1.5488) loss_answer_type: 35.9053 (35.5551) loss_answer_obj: 16.3921 (14.5897) loss_answer_attr: 110.3442 (100.5373) loss_answer_rel: 140.3934 (150.9077) loss_answer_global: 0.0000 (8.8592) loss_answer_cat: 0.0000 (24.7559) loss_ce_unscaled: 2.3961 (2.5253) loss_bbox_unscaled: 0.0652 (0.0634) loss_giou_unscaled: 0.2313 (0.2595) cardinality_error_unscaled: 1.0000 (1.1037) loss_contrastive_align_unscaled: 1.3347 (1.5488) loss_answer_type_unscaled: 1.4362 (1.4222) accuracy_answer_type_unscaled: 0.5000 (0.3902) loss_answer_obj_unscaled: 0.6557 (0.5836) accuracy_answer_obj_unscaled: 1.0000 (0.7154) loss_answer_attr_unscaled: 4.4138 (4.0215) accuracy_answer_attr_unscaled: 0.5000 (0.3780) loss_answer_rel_unscaled: 5.6157 (6.0363) accuracy_answer_rel_unscaled: 0.0000 (0.1911) loss_answer_global_unscaled: 0.0000 (0.3544) accuracy_answer_global_unscaled: 1.0000 (0.9268) loss_answer_cat_unscaled: 0.0000 (0.9902) accuracy_answer_cat_unscaled: 1.0000 (0.8537) accuracy_answer_total_unscaled: 0.0000 (0.0488) time: 14.6253 data: 0.0905 max mem: 0
Traceback (most recent call last):
File "main.py", line 643, in <module>
main(args)
File "main.py", line 546, in main
train_stats = train_one_epoch(
File "/home/pchelintsev/MDETR/mdetr/engine.py", line 100, in train_one_epoch
losses.backward()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 244470272 bytes. Error code 12 (Cannot allocate memory)
And indeed the RAM ran out (I saw it with free -m
), so it's not enough 64 Gb. But why do strange mistakes attack me earlier than Epoch: [0] [ 40/163172]
when I use GPU?
I also tried to run evaluation on CLEVR, which is much smaller, using the command from the guide
python main.py --batch_size 64 --dataset_config configs/clevr.json --num_queries 25 --text_encoder_type distilroberta-base --backbone resnet18 --resume https://zenodo.org/record/4721981/files/clevr_checkpoint.pth --eval
and again got
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor1421
But I have 64 Gb RAM and 32 Gb GPU, so the whole dataset and the models can be easily allocated. It seems this is not a bug with resources... And then I run on CPU, it seems to work, consuming only about 11 Gb RAM.
Try setting this line to false: https://github.com/ashkamath/mdetr/blob/0b747b99e2995c3c429f1391cb8e6104eaec7f21/main.py#L309
Also, could you paste the output of python -m torch.utils.collect_env ?
@alcinos, thank you for your support!
1) I changed the line on torch.set_deterministic(False)
and the evaluation on CLEVR went successfully!
Accumulating evaluation results...
DONE (t=106.84s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.828
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.990
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.988
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.718
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.830
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.907
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.373
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.872
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.872
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.786
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.874
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.934
{'test_clevr_loss': 43.043012584842515, 'test_clevr_loss_ce': 5.731913995213883, 'test_clevr_loss_bbox': 0.05494903914992771, 'test_clevr_loss_giou': 0.18080522204549035, 'test_clevr_loss_contrastive_align': 0.739596603851473, 'test_clevr_loss_ce_0': 6.165245272397181, 'test_clevr_loss_bbox_0': 0.06370836580744324, 'test_clevr_loss_giou_0': 0.21175039811345905, 'test_clevr_loss_contrastive_align_0': 1.6258955893866438, 'test_clevr_loss_ce_1': 5.9863008091881005, 'test_clevr_loss_bbox_1': 0.06160141016855683, 'test_clevr_loss_giou_1': 0.20656590048007592, 'test_clevr_loss_contrastive_align_1': 1.2556593710804556, 'test_clevr_loss_ce_2': 5.867624813379281, 'test_clevr_loss_bbox_2': 0.059281312062726084, 'test_clevr_loss_giou_2': 0.19940767474091095, 'test_clevr_loss_contrastive_align_2': 1.0192198974077205, 'test_clevr_loss_ce_3': 5.771324536906167, 'test_clevr_loss_bbox_3': 0.05730514128536698, 'test_clevr_loss_giou_3': 0.194069205697486, 'test_clevr_loss_contrastive_align_3': 0.817971428532039, 'test_clevr_loss_ce_4': 5.741090293833827, 'test_clevr_loss_bbox_4': 0.05582752994519147, 'test_clevr_loss_giou_4': 0.18599585877292799, 'test_clevr_loss_contrastive_align_4': 0.7558641342681423, 'test_clevr_loss_answer_type': 3.604563896833151e-06, 'test_clevr_loss_answer_binary': 0.00632767150950992, 'test_clevr_loss_answer_reg': 0.025332900922053852, 'test_clevr_loss_answer_attr': 0.0023749297052867175, 'test_clevr_loss_ce_unscaled': 5.731913995213883, 'test_clevr_loss_bbox_unscaled': 0.010989807827442681, 'test_clevr_loss_giou_unscaled': 0.09040261102274517, 'test_clevr_cardinality_error_unscaled': 0.01639825085324232, 'test_clevr_loss_contrastive_align_unscaled': 0.739596603851473, 'test_clevr_loss_ce_0_unscaled': 6.165245272397181, 'test_clevr_loss_bbox_0_unscaled': 0.012741673153303818, 'test_clevr_loss_giou_0_unscaled': 0.10587519905672953, 'test_clevr_cardinality_error_0_unscaled': 1.9568324677976732, 'test_clevr_loss_contrastive_align_0_unscaled': 1.6258955893866438, 'test_clevr_loss_ce_1_unscaled': 5.9863008091881005, 'test_clevr_loss_bbox_1_unscaled': 0.012320282033075652, 'test_clevr_loss_giou_1_unscaled': 0.10328295024003796, 'test_clevr_cardinality_error_1_unscaled': 1.115309238815267, 'test_clevr_loss_contrastive_align_1_unscaled': 1.2556593710804556, 'test_clevr_loss_ce_2_unscaled': 5.867624813379281, 'test_clevr_loss_bbox_2_unscaled': 0.011856262410399679, 'test_clevr_loss_giou_2_unscaled': 0.09970383737045548, 'test_clevr_cardinality_error_2_unscaled': 0.5807346361705162, 'test_clevr_loss_contrastive_align_2_unscaled': 1.0192198974077205, 'test_clevr_loss_ce_3_unscaled': 5.771324536906167, 'test_clevr_loss_bbox_3_unscaled': 0.01146102826025197, 'test_clevr_loss_giou_3_unscaled': 0.097034602848743, 'test_clevr_cardinality_error_3_unscaled': 0.1629644974624363, 'test_clevr_loss_contrastive_align_3_unscaled': 0.817971428532039, 'test_clevr_loss_ce_4_unscaled': 5.741090293833827, 'test_clevr_loss_bbox_4_unscaled': 0.01116550598355525, 'test_clevr_loss_giou_4_unscaled': 0.09299792938646399, 'test_clevr_cardinality_error_4_unscaled': 0.04800581955033078, 'test_clevr_loss_contrastive_align_4_unscaled': 0.7558641342681423, 'test_clevr_loss_answer_type_unscaled': 3.604563896833151e-06, 'test_clevr_accuracy_answer_type_unscaled': 1.0, 'test_clevr_loss_answer_binary_unscaled': 0.00632767150950992, 'test_clevr_accuracy_answer_binary_unscaled': 0.9983339897855964, 'test_clevr_loss_answer_reg_unscaled': 0.025332900922053852, 'test_clevr_accuracy_answer_reg_unscaled': 0.9925868332182588, 'test_clevr_loss_answer_attr_unscaled': 0.0023749297052867175, 'test_clevr_accuracy_answer_attr_unscaled': 0.9996434087283376, 'test_clevr_accuracy_answer_total_unscaled': 0.9974136092150171, 'test_clevr_coco_eval_bbox': [0.8280577103267962, 0.9900612713060782, 0.9877238191805104, 0.7180399374430324, 0.8298110005371897, 0.9071876701127315, 0.37329648287548844, 0.8716828824051552, 0.8719843337521758, 0.7862159823816266, 0.8740682349128326, 0.9336712955657251], 'n_parameters': 111200939}
2) Here is the output of python -m torch.utils.collect_env
:
Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27
Python version: 3.8 (64-bit runtime)
Python platform: Linux-4.15.0-156-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.20.3 py38hf144106_0
[conda] numpy-base 1.20.3 py38h74d4b33_0
[conda] torch 1.9.0 pypi_0 pypi
[conda] torchvision 0.10.0 pypi_0 pypi
I also try VQA2 again a bit later but this gave me hope! The only thing that confuses me is running with run_with_submitit.py
not main.py
. But let's try :)
UPD: as for VQA2, I had a normal error:
submitit ERROR (2021-09-20 22:20:53,411) - Submitted job triggered an exception
Traceback (most recent call last):
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 71, in submitit_main
process_job(args.folder)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 64, in process_job
raise error
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/submission.py", line 53, in process_job
result = delayed.result()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/submitit/core/utils.py", line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File "run_with_submitit.py", line 98, in __call__
detection.main(self.args)
File "/home/pchelintsev/MDETR/mdetr/main.py", line 546, in main
train_stats = train_one_epoch(
File "/home/pchelintsev/MDETR/mdetr/engine.py", line 54, in train_one_epoch
for i, batch_dict in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/home/pchelintsev/MDETR/mdetr/util/metrics.py", line 133, in log_every
for obj in iterable:
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
return self._get_iterator()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
w.start()
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/pchelintsev/anaconda3/envs/mdetr_env/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
It seems 64 Gb RAM isn't enough. How many did you use?...
use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP
use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP
Sorry, I haven't understood, @4-0-4-notfound.(( Could you provide more information? Where should I change the BatchNorm and why?
use nn.SyncBatchNorm instead of nn.BatchNormxD in DDP
Sorry, I haven't understood, @4-0-4-notfound.(( Could you provide more information? Where should I change the BatchNorm and why?
In my case, it is the DDP bug with broadcast_buffers
, meanwhile, the original BN has broadcast_buffers
. Thus, i need to change the original BN into SyncBatchNorm to fix the broadcast_buffers
bug in DDP. https://github.com/pytorch/pytorch/issues/22095#issuecomment-941522465
This is just a bugfix of the issuer i.e.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 10]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Maybe is it not working for yours.
Hello,
Thanks for open sourcing!
I try to run distributed training for pretraining. Without distributed training, it works fine.
I get the below error. I tried with pytorch versions
1.7.0
,1.7.1
and1.8.0
They get below error. Version1.9
getsImportError: cannot import name '_new_empty_tensor' from 'torchvision.ops' **(/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/ops/__init__.py)
** ``I tried changing this line to
losses.backward(retain_graph=True)
, it did not fix. Let me know if you have any suggestions on how to address this issue.