Training Error assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Kyfafyd commented 2 years ago

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote
```
Nothing change
```
what exact command you run: python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path ../data/COCO2017 --output_dir output/conddetr_r50_epoch50

what you observed (including full logs):


| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 4): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 1): env://
| distributed init (rank 7): env://
| distributed init (rank 6): env://
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /research/d4)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
git:
sha: N/A, status: clean, branch: N/A

fatal: Not a git repository (or any parent up to mount point /research/d4) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='../data/COCO2017', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, nheads=8, num_queries=300, num_workers=2, output_dir='output/conddetr_r50_epoch50', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8) number of params: 43196001 loading annotations into memory... Done (t=20.78s) creating index... index created! loading annotations into memory... Done (t=0.56s) creating index... index created! Start training Epoch: [0] [ 0/7393] eta: 7:05:21 lr: 0.000100 class_error: 85.57 loss: 45.1821 (45.1821) loss_bbox: 3.7751 (3.7751) loss_bbox_0: 3.7823 (3.7823) loss_bbox_1: 3.7808 (3.7808) loss_bbox_2: 3.7756 (3.7756) loss_bbox_3: 3.7911 (3.7911) loss_bbox_4: 3.7856 (3.7856) loss_ce: 1.9574 (1.9574) loss_ce_0: 2.0151 (2.0151) loss_ce_1: 2.0196 (2.0196) loss_ce_2: 2.1484 (2.1484) loss_ce_3: 2.0683 (2.0683) loss_ce_4: 2.0683 (2.0683) loss_giou: 1.7011 (1.7011) loss_giou_0: 1.7000 (1.7000) loss_giou_1: 1.7040 (1.7040) loss_giou_2: 1.7059 (1.7059) loss_giou_3: 1.7022 (1.7022) loss_giou_4: 1.7012 (1.7012) cardinality_error_unscaled: 293.1250 (293.1250) cardinality_error_0_unscaled: 293.1250 (293.1250) cardinality_error_1_unscaled: 293.1250 (293.1250) cardinality_error_2_unscaled: 281.9375 (281.9375) cardinality_error_3_unscaled: 293.1250 (293.1250) cardinality_error_4_unscaled: 293.1250 (293.1250) class_error_unscaled: 85.5712 (85.5712) loss_bbox_unscaled: 0.7550 (0.7550) loss_bbox_0_unscaled: 0.7565 (0.7565) loss_bbox_1_unscaled: 0.7562 (0.7562) loss_bbox_2_unscaled: 0.7551 (0.7551) loss_bbox_3_unscaled: 0.7582 (0.7582) loss_bbox_4_unscaled: 0.7571 (0.7571) loss_ce_unscaled: 0.9787 (0.9787) loss_ce_0_unscaled: 1.0076 (1.0076) loss_ce_1_unscaled: 1.0098 (1.0098) loss_ce_2_unscaled: 1.0742 (1.0742) loss_ce_3_unscaled: 1.0341 (1.0341) loss_ce_4_unscaled: 1.0342 (1.0342) loss_giou_unscaled: 0.8506 (0.8506) loss_giou_0_unscaled: 0.8500 (0.8500) loss_giou_1_unscaled: 0.8520 (0.8520) loss_giou_2_unscaled: 0.8530 (0.8530) loss_giou_3_unscaled: 0.8511 (0.8511) loss_giou_4_unscaled: 0.8506 (0.8506) time: 3.4521 data: 0.4687 max mem: 2932 Epoch: [0] [ 100/7393] eta: 1:17:39 lr: 0.000100 class_error: 85.74 loss: 28.2629 (33.7855) loss_bbox: 1.5517 (2.3437) loss_bbox_0: 1.5566 (2.3695) loss_bbox_1: 1.5482 (2.3519) loss_bbox_2: 1.5535 (2.3396) loss_bbox_3: 1.5641 (2.3476) loss_bbox_4: 1.5637 (2.3431) loss_ce: 1.5467 (1.6584) loss_ce_0: 1.5650 (1.6414) loss_ce_1: 1.5443 (1.6461) loss_ce_2: 1.5557 (1.6477) loss_ce_3: 1.5392 (1.6545) loss_ce_4: 1.5541 (1.6667) loss_giou: 1.5534 (1.6289) loss_giou_0: 1.5514 (1.6296) loss_giou_1: 1.5541 (1.6292) loss_giou_2: 1.5695 (1.6291) loss_giou_3: 1.5526 (1.6289) loss_giou_4: 1.5519 (1.6296) cardinality_error_unscaled: 293.1875 (293.2420) cardinality_error_0_unscaled: 293.1875 (293.2420) cardinality_error_1_unscaled: 293.1875 (293.2420) cardinality_error_2_unscaled: 293.1875 (293.1312) cardinality_error_3_unscaled: 293.1875 (293.2420) cardinality_error_4_unscaled: 293.1875 (293.1658) class_error_unscaled: 75.6680 (75.4478) loss_bbox_unscaled: 0.3103 (0.4687) loss_bbox_0_unscaled: 0.3113 (0.4739) loss_bbox_1_unscaled: 0.3096 (0.4704) loss_bbox_2_unscaled: 0.3107 (0.4679) loss_bbox_3_unscaled: 0.3128 (0.4695) loss_bbox_4_unscaled: 0.3127 (0.4686) loss_ce_unscaled: 0.7733 (0.8292) loss_ce_0_unscaled: 0.7825 (0.8207) loss_ce_1_unscaled: 0.7722 (0.8231) loss_ce_2_unscaled: 0.7779 (0.8239) loss_ce_3_unscaled: 0.7696 (0.8272) loss_ce_4_unscaled: 0.7770 (0.8334) loss_giou_unscaled: 0.7767 (0.8145) loss_giou_0_unscaled: 0.7757 (0.8148) loss_giou_1_unscaled: 0.7771 (0.8146) loss_giou_2_unscaled: 0.7847 (0.8146) loss_giou_3_unscaled: 0.7763 (0.8144) loss_giou_4_unscaled: 0.7760 (0.8148) time: 0.6098 data: 0.0105 max mem: 4353 Traceback (most recent call last): File "main.py", line 258, in main(args) File "main.py", line 206, in main train_stats = train_one_epoch( File "/research/d4/gds/zwang21/ConditionalDETR/engine.py", line 41, in train_one_epoch loss_dict = criterion(outputs, targets) File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/research/d4/gds/zwang21/ConditionalDETR/models/conditional_detr.py", line 254, in forward indices = self.matcher(outputs_without_aux, targets) File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "/research/d4/gds/zwang21/ConditionalDETR/models/matcher.py", line 79, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File "/research/d4/gds/zwang21/ConditionalDETR/util/box_ops.py", line 59, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() AssertionError Traceback (most recent call last): File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/research/d4/gds/zwang21/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/research/d4/gds/zwang21/anaconda3/bin/python', '-u', 'main.py', '--coco_path', '../data/COCO2017', '--output_dir', 'output/conddetr_r50_epoch50']' returned non-zero exit status 1.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Killing subprocess 29668 Killing subprocess 29669 Killing subprocess 29670 Killing subprocess 29671 Killing subprocess 29672 Killing subprocess 29673 Killing subprocess 29674 Killing subprocess 29675

4. please simplify the steps as much as possible so they do not require additional resources to
     run, such as a private dataset.

## Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

## Environment:

Provide your environment information using the following command:

Collecting environment information... PyTorch version: 1.8.0 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64) GCC version: (GCC) 11.2.0 Clang version: Could not collect CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime) Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.22.2 [pip3] numpydoc==1.1.0 [pip3] pytorch-ignite==0.2.0 [pip3] pytorch-metric-learning==0.9.99 [pip3] torch==1.8.0 [pip3] torchaudio==0.8.0a0+a751e1d [pip3] torchfile==0.1.0 [pip3] torchsampler==0.1.1 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.9.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.2.0 h06a4308_296
[conda] mkl-service 2.3.0 py38h27cfd23_1
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.1 py38ha9443f7_2
[conda] numpy 1.22.2 pypi_0 pypi [conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch 1.8.0 py3.8_cuda10.2_cudnn7.6.5_0 pytorch [conda] pytorch-ignite 0.2.0 pypi_0 pypi [conda] pytorch-metric-learning 0.9.99 pypi_0 pypi [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 1.10.0 pypi_0 pypi [conda] torchaudio 0.8.0 py38 pytorch [conda] torchfile 0.1.0 pypi_0 pypi [conda] torchsampler 0.1.1 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.9.0 py38_cu102 pytorch

DeppMeng commented 2 years ago

Sorry but we never encountered this error. It indicates that the predicted boxes have a negative width or height. Which should not happen. The predicted (cx, cy, h, w) are fed into a sigmoid, so all h, w should be in range [0, 1].

RicePasteM commented 4 days ago

Encountered the same issue, and solved by only using 4 GPUs. Maybe it's caused by AMP or the internal bug of distributed training.

Atten4Vis / ConditionalDETR

Training Error assert (boxes1[:, 2:] >= boxes1[:, :2]).all() #17

Instructions To Reproduce the 🐛 Bug: