error when changed the num_classes

varagantis commented 1 year ago

Hello I am facing following error when I tried to train the model using custom dataset that has 5 classes. I know the error below would majorly occur because of difference in num_classes, but not sure what is the effective rectification for this: Traceback (most recent call last): File "main.py", line 326, in main(args) File "main.py", line 275, in main train_stats = train_one_epoch( File "/home/vsrikar/engine.py", line 43, in train_one_epoch loss_dict = criterion(outputs, targets) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/vsrikar/models/deformable_detr.py", line 342, in forward indices = self.matcher(outputs_without_aux, targets) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/home/vsrikar/models/matcher.py", line 87, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), File "/home/vsrikar/util/box_ops.py", line 59, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f89f1df01ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #1: + 0x26e61 (0x7f89f1e6ae61 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x257 (0x7f89f1e6fdb7 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #3: + 0x466858 (0x7f89f641c858 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f89f1dd77a5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #5: + 0x362735 (0x7f89f6318735 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #6: + 0x67c6c8 (0x7f89f66326c8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f89f6632a95 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #8: python() [0x5d1908] frame #9: python() [0x5a978d] frame #10: python() [0x5ecd90] frame #11: python() [0x5447b8] frame #12: python() [0x54480a] frame #13: python() [0x54480a]

frame #19: __libc_start_main + 0xf3 (0x7f89fa857083 in /usr/lib/x86_64-linux-gnu/libc.so.6) ./configs/r50_deformable_detr.sh: line 10: 997 Aborted (core dumped) python -u main.py --output_dir ${EXP_DIR} ${PY_ARGS} Traceback (most recent call last): File "./tools/launch.py", line 192, in main() File "./tools/launch.py", line 187, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['./configs/r50_deformable_detr.sh']' returned non-zero exit status 134. I changed the following code snippet in deformable_detr.py: def build(args): num_classes = 5 if args.dataset_file != 'coco' else 91 when I change the num_classes back to 20, it is working out fine. Please suggest how to handle this issue.

ilmaster commented 1 year ago

It's a problem from coco in pycocotools.

coco class assign ID from 1 to N.

In the case of deformable DETR, it is a problem because it allocates from 0 to N-1.

mc-lgt commented 1 year ago

You can try adding "tgt_ids = torch.sub(tgt_ids, 1, alpha=1, out=None)" in the matcher.py.

fundamentalvision / Deformable-DETR

error when changed the num_classes #185