Open Kyfafyd opened 2 years ago
Sorry but we never encountered this error. It indicates that the predicted boxes have a negative width or height. Which should not happen. The predicted (cx, cy, h, w) are fed into a sigmoid, so all h, w should be in range [0, 1].
Encountered the same issue, and solved by only using 4 GPUs. Maybe it's caused by AMP or the internal bug of distributed training.
Instructions To Reproduce the 🐛 Bug:
git diff
) or what code you wrotepython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path ../data/COCO2017 --output_dir output/conddetr_r50_epoch50
fatal: Not a git repository (or any parent up to mount point /research/d4) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='../data/COCO2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, nheads=8, num_queries=300, num_workers=2, output_dir='output/conddetr_r50_epoch50', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8) number of params: 43196001 loading annotations into memory... Done (t=20.78s) creating index... index created! loading annotations into memory... Done (t=0.56s) creating index... index created! Start training Epoch: [0] [ 0/7393] eta: 7:05:21 lr: 0.000100 class_error: 85.57 loss: 45.1821 (45.1821) loss_bbox: 3.7751 (3.7751) loss_bbox_0: 3.7823 (3.7823) loss_bbox_1: 3.7808 (3.7808) loss_bbox_2: 3.7756 (3.7756) loss_bbox_3: 3.7911 (3.7911) loss_bbox_4: 3.7856 (3.7856) loss_ce: 1.9574 (1.9574) loss_ce_0: 2.0151 (2.0151) loss_ce_1: 2.0196 (2.0196) loss_ce_2: 2.1484 (2.1484) loss_ce_3: 2.0683 (2.0683) loss_ce_4: 2.0683 (2.0683) loss_giou: 1.7011 (1.7011) loss_giou_0: 1.7000 (1.7000) loss_giou_1: 1.7040 (1.7040) loss_giou_2: 1.7059 (1.7059) loss_giou_3: 1.7022 (1.7022) loss_giou_4: 1.7012 (1.7012) cardinality_error_unscaled: 293.1250 (293.1250) cardinality_error_0_unscaled: 293.1250 (293.1250) cardinality_error_1_unscaled: 293.1250 (293.1250) cardinality_error_2_unscaled: 281.9375 (281.9375) cardinality_error_3_unscaled: 293.1250 (293.1250) cardinality_error_4_unscaled: 293.1250 (293.1250) class_error_unscaled: 85.5712 (85.5712) loss_bbox_unscaled: 0.7550 (0.7550) loss_bbox_0_unscaled: 0.7565 (0.7565) loss_bbox_1_unscaled: 0.7562 (0.7562) loss_bbox_2_unscaled: 0.7551 (0.7551) loss_bbox_3_unscaled: 0.7582 (0.7582) loss_bbox_4_unscaled: 0.7571 (0.7571) loss_ce_unscaled: 0.9787 (0.9787) loss_ce_0_unscaled: 1.0076 (1.0076) loss_ce_1_unscaled: 1.0098 (1.0098) loss_ce_2_unscaled: 1.0742 (1.0742) loss_ce_3_unscaled: 1.0341 (1.0341) loss_ce_4_unscaled: 1.0342 (1.0342) loss_giou_unscaled: 0.8506 (0.8506) loss_giou_0_unscaled: 0.8500 (0.8500) loss_giou_1_unscaled: 0.8520 (0.8520) loss_giou_2_unscaled: 0.8530 (0.8530) loss_giou_3_unscaled: 0.8511 (0.8511) loss_giou_4_unscaled: 0.8506 (0.8506) time: 3.4521 data: 0.4687 max mem: 2932 Epoch: [0] [ 100/7393] eta: 1:17:39 lr: 0.000100 class_error: 85.74 loss: 28.2629 (33.7855) loss_bbox: 1.5517 (2.3437) loss_bbox_0: 1.5566 (2.3695) loss_bbox_1: 1.5482 (2.3519) loss_bbox_2: 1.5535 (2.3396) loss_bbox_3: 1.5641 (2.3476) loss_bbox_4: 1.5637 (2.3431) loss_ce: 1.5467 (1.6584) loss_ce_0: 1.5650 (1.6414) loss_ce_1: 1.5443 (1.6461) loss_ce_2: 1.5557 (1.6477) loss_ce_3: 1.5392 (1.6545) loss_ce_4: 1.5541 (1.6667) loss_giou: 1.5534 (1.6289) loss_giou_0: 1.5514 (1.6296) loss_giou_1: 1.5541 (1.6292) loss_giou_2: 1.5695 (1.6291) loss_giou_3: 1.5526 (1.6289) loss_giou_4: 1.5519 (1.6296) cardinality_error_unscaled: 293.1875 (293.2420) cardinality_error_0_unscaled: 293.1875 (293.2420) cardinality_error_1_unscaled: 293.1875 (293.2420) cardinality_error_2_unscaled: 293.1875 (293.1312) cardinality_error_3_unscaled: 293.1875 (293.2420) cardinality_error_4_unscaled: 293.1875 (293.1658) class_error_unscaled: 75.6680 (75.4478) loss_bbox_unscaled: 0.3103 (0.4687) loss_bbox_0_unscaled: 0.3113 (0.4739) loss_bbox_1_unscaled: 0.3096 (0.4704) loss_bbox_2_unscaled: 0.3107 (0.4679) loss_bbox_3_unscaled: 0.3128 (0.4695) loss_bbox_4_unscaled: 0.3127 (0.4686) loss_ce_unscaled: 0.7733 (0.8292) loss_ce_0_unscaled: 0.7825 (0.8207) loss_ce_1_unscaled: 0.7722 (0.8231) loss_ce_2_unscaled: 0.7779 (0.8239) loss_ce_3_unscaled: 0.7696 (0.8272) loss_ce_4_unscaled: 0.7770 (0.8334) loss_giou_unscaled: 0.7767 (0.8145) loss_giou_0_unscaled: 0.7757 (0.8148) loss_giou_1_unscaled: 0.7771 (0.8146) loss_giou_2_unscaled: 0.7847 (0.8146) loss_giou_3_unscaled: 0.7763 (0.8144) loss_giou_4_unscaled: 0.7760 (0.8148) time: 0.6098 data: 0.0105 max mem: 4353 Traceback (most recent call last): File "main.py", line 258, in
main(args)
File "main.py", line 206, in main
train_stats = train_one_epoch(
File "/research/d4/gds/zwang21/ConditionalDETR/engine.py", line 41, in train_one_epoch
loss_dict = criterion(outputs, targets)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, kwargs)
File "/research/d4/gds/zwang21/ConditionalDETR/models/conditional_detr.py", line 254, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/research/d4/gds/zwang21/ConditionalDETR/models/matcher.py", line 79, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/research/d4/gds/zwang21/ConditionalDETR/util/box_ops.py", line 59, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
Traceback (most recent call last):
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/research/d4/gds/zwang21/anaconda3/bin/python', '-u', 'main.py', '--coco_path', '../data/COCO2017', '--output_dir', 'output/conddetr_r50_epoch50']' returned non-zero exit status 1.
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Killing subprocess 29668 Killing subprocess 29669 Killing subprocess 29670 Killing subprocess 29671 Killing subprocess 29672 Killing subprocess 29673 Killing subprocess 29674 Killing subprocess 29675
Collecting environment information... PyTorch version: 1.8.0 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: CentOS Linux release 7.9.2009 (Core) (x86_64) GCC version: (GCC) 11.2.0 Clang version: Could not collect CMake version: version 2.8.12.2
Python version: 3.8 (64-bit runtime) Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.22.2 [pip3] numpydoc==1.1.0 [pip3] pytorch-ignite==0.2.0 [pip3] pytorch-metric-learning==0.9.99 [pip3] torch==1.8.0 [pip3] torchaudio==0.8.0a0+a751e1d [pip3] torchfile==0.1.0 [pip3] torchsampler==0.1.1 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.9.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.3.0 py38h27cfd23_1
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.1 py38ha9443f7_2
[conda] numpy 1.22.2 pypi_0 pypi [conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch 1.8.0 py3.8_cuda10.2_cudnn7.6.5_0 pytorch [conda] pytorch-ignite 0.2.0 pypi_0 pypi [conda] pytorch-metric-learning 0.9.99 pypi_0 pypi [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 1.10.0 pypi_0 pypi [conda] torchaudio 0.8.0 py38 pytorch [conda] torchfile 0.1.0 pypi_0 pypi [conda] torchsampler 0.1.1 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.9.0 py38_cu102 pytorch