SwinTransformer / Swin-Transformer-Object-Detection

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.
https://arxiv.org/abs/2103.14030
Apache License 2.0
1.79k stars 379 forks source link

coco.py bug for only one class #46

Open lingcong-k opened 3 years ago

lingcong-k commented 3 years ago

HI, So am training with just one class, in coco.py i set

CLASSES = ('person')

but later on when checking the consistency of class number

assert module.num_classes == len(dataset.CLASSES)

len(dataset.CLASSES) = len('person') = 6

but if the class are more than one, its fine.. coz dataset.CLASSES is now a tuple

bfialkoff commented 3 years ago

You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)

lingcong-k commented 3 years ago

You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)

@bfialkoff Thanks for ur reply. weird,, yesterday I also tried with ('person',) it led to another error below

2021-05-31 14:52:20,658 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs
INFO:mmdet:workflow: [('train', 1)], max: 36 epochs
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-05-31 14:52:23,657 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
INFO:mmcv:Reducer buckets have been rebuilt in this iteration.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [116,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [118,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [120,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "tools/train.py", line 187, in <module>
    main()
  File "tools/train.py", line 183, in main
    meta=meta)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/apis/train.py", line 185, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 51, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 247, in train_step
    losses = self(**data)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 124, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 181, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/two_stage.py", line 156, in forward_train
    proposal_cfg=proposal_cfg)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train
    losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/rpn_head.py", line 78, in loss
    gt_bboxes_ignore=gt_bboxes_ignore)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 209, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 466, in loss
    label_channels=label_channels)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 348, in get_targets
    unmap_outputs=unmap_outputs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/utils/misc.py", line 29, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 225, in _get_targets_single
    gt_bboxes)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/bbox/samplers/base_sampler.py", line 97, in sample
    neg_inds = neg_inds.unique()
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/tensor.py", line 511, in unique
    return torch.unique(self, sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
    return if_false(*args, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
    return if_false(*args, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 831, in _return_output
    output, _, _ = _unique_impl(input, sorted, return_inverse, return_counts, dim)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 749, in _unique_impl
    return_counts=return_counts,
RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554800319/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f3b518972f2 in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f3b5189467b in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)

Do you by chance know why? Thanks :)

bfialkoff commented 3 years ago

Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu)

On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote:

You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)

weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

which seems related to distributed training (am training without distributed training)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-851459380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .

lingcong-k commented 3 years ago

Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .

I tried train.sh with one GPU too but its the same error.

But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version

bfialkoff commented 3 years ago

would be nice if you could track it down.

On Tue, Jun 1, 2021, 18:17 lingcong-k @.***> wrote:

Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) … <#m-8212419187521595482> On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment) https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-851459380>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .

I tried train.sh with one GPU too but its the same error.

But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-852210174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM2OMI7KXKO5S4NSEEDTQT2YDANCNFSM45ZPSBRA .

lingcong-k commented 3 years ago

would be nice if you could track it down. On Tue, Jun 1, 2021, 18:17 lingcong-k @.> wrote: Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) … <#m-8212419187521595482> On Mon, May 31, 2021, 15:32 lingcong-k @.> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment) <#46 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA . I tried train.sh with one GPU too but its the same error. But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM2OMI7KXKO5S4NSEEDTQT2YDANCNFSM45ZPSBRA .

I uninstall mmcvfull and install the newer version according to here: https://github.com/open-mmlab/mmdetection/issues/2627 More specifically this command: "pip install mmcv-full==1.2.4 -i https://pypi.tuna.tsinghua.edu.cn/simple"

and then it worked!

nikhil031294 commented 3 years ago

You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)

@bfialkoff Thanks for ur reply. weird,, yesterday I also tried with ('person',) it led to another error below

2021-05-31 14:52:20,658 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs
INFO:mmdet:workflow: [('train', 1)], max: 36 epochs
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-05-31 14:52:23,657 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
INFO:mmcv:Reducer buckets have been rebuilt in this iteration.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [116,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [118,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [120,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "tools/train.py", line 187, in <module>
    main()
  File "tools/train.py", line 183, in main
    meta=meta)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/apis/train.py", line 185, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 51, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 247, in train_step
    losses = self(**data)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 124, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 181, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/two_stage.py", line 156, in forward_train
    proposal_cfg=proposal_cfg)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train
    losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/rpn_head.py", line 78, in loss
    gt_bboxes_ignore=gt_bboxes_ignore)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 209, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 466, in loss
    label_channels=label_channels)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 348, in get_targets
    unmap_outputs=unmap_outputs)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/utils/misc.py", line 29, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 225, in _get_targets_single
    gt_bboxes)
  File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/bbox/samplers/base_sampler.py", line 97, in sample
    neg_inds = neg_inds.unique()
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/tensor.py", line 511, in unique
    return torch.unique(self, sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
    return if_false(*args, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
    return if_false(*args, **kwargs)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 831, in _return_output
    output, _, _ = _unique_impl(input, sorted, return_inverse, return_counts, dim)
  File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 749, in _unique_impl
    return_counts=return_counts,
RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554800319/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f3b518972f2 in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f3b5189467b in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)

Do you by chance know why? Thanks :)

classes = tuple(['hands']) I wrote classes like this and it worked