Open lingcong-k opened 3 years ago
You need a comma CLASSES = ('person', )
otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)
You need a comma
CLASSES = ('person', )
otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)
@bfialkoff Thanks for ur reply. weird,, yesterday I also tried with ('person',) it led to another error below
2021-05-31 14:52:20,658 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs
INFO:mmdet:workflow: [('train', 1)], max: 36 epochs
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
2021-05-31 14:52:23,657 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
INFO:mmcv:Reducer buckets have been rebuilt in this iteration.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [116,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [118,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [120,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "tools/train.py", line 187, in <module>
main()
File "tools/train.py", line 183, in main
meta=meta)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/apis/train.py", line 185, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 51, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(**data)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 124, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 181, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/two_stage.py", line 156, in forward_train
proposal_cfg=proposal_cfg)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/rpn_head.py", line 78, in loss
gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 209, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 466, in loss
label_channels=label_channels)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 348, in get_targets
unmap_outputs=unmap_outputs)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/utils/misc.py", line 29, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 225, in _get_targets_single
gt_bboxes)
File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/bbox/samplers/base_sampler.py", line 97, in sample
neg_inds = neg_inds.unique()
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/tensor.py", line 511, in unique
return torch.unique(self, sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
return if_false(*args, **kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn
return if_false(*args, **kwargs)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 831, in _return_output
output, _, _ = _unique_impl(input, sorted, return_inverse, return_counts, dim)
File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 749, in _unique_impl
return_counts=return_counts,
RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554800319/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f3b518972f2 in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f3b5189467b in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)
Do you by chance know why? Thanks :)
Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu)
On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote:
You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)
weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
which seems related to distributed training (am training without distributed training)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-851459380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .
Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) … On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .
I tried train.sh with one GPU too but its the same error.
But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version
would be nice if you could track it down.
On Tue, Jun 1, 2021, 18:17 lingcong-k @.***> wrote:
Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) … <#m-8212419187521595482> On Mon, May 31, 2021, 15:32 lingcong-k @.***> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment) https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-851459380>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA .
I tried train.sh with one GPU too but its the same error.
But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/issues/46#issuecomment-852210174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM2OMI7KXKO5S4NSEEDTQT2YDANCNFSM45ZPSBRA .
would be nice if you could track it down. … On Tue, Jun 1, 2021, 18:17 lingcong-k @.> wrote: Im working on the same issue in the meantime I have an ugly work around, alternatively you can train distributed using the train.sh file and set 1 gpu. (assuming you only have 1 gpu) … <#m-8212419187521595482> On Mon, May 31, 2021, 15:32 lingcong-k @.> wrote: You need a comma CLASSES = ('person', ) otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person) weired,, yesterday I also tried with ('person',) it led to another error even before the assert line (some transform error, dont remember) but I tried it now, it seems ok.. but it still has this error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. which seems related to distributed training (am training without distributed training) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment) <#46 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM7SMCTFBIAYFIWVD5TTQN6VPANCNFSM45ZPSBRA . I tried train.sh with one GPU too but its the same error. But I solve it with install some wheel for mmcv from other issues mentioned somewhere (cant find it now), I also upgraded torch to the newest version — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO6FM2OMI7KXKO5S4NSEEDTQT2YDANCNFSM45ZPSBRA .
I uninstall mmcvfull and install the newer version according to here: https://github.com/open-mmlab/mmdetection/issues/2627 More specifically this command: "pip install mmcv-full==1.2.4 -i https://pypi.tuna.tsinghua.edu.cn/simple"
and then it worked!
You need a comma
CLASSES = ('person', )
otherwise its interpreted as a string and therefore the length is 6 (the number of letters in person)@bfialkoff Thanks for ur reply. weird,, yesterday I also tried with ('person',) it led to another error below
2021-05-31 14:52:20,658 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs INFO:mmdet:workflow: [('train', 1)], max: 36 epochs Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 2021-05-31 14:52:23,657 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. INFO:mmcv:Reducer buckets have been rebuilt in this iteration. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [116,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [118,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [120,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed. Traceback (most recent call last): File "tools/train.py", line 187, in <module> main() File "tools/train.py", line 183, in main meta=meta) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/apis/train.py", line 185, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 51, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 247, in train_step losses = self(**data) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 124, in new_func output = old_func(*new_args, **new_kwargs) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/base.py", line 181, in forward return self.forward_train(img, img_metas, **kwargs) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/detectors/two_stage.py", line 156, in forward_train proposal_cfg=proposal_cfg) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/rpn_head.py", line 78, in loss gt_bboxes_ignore=gt_bboxes_ignore) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 209, in new_func output = old_func(*new_args, **new_kwargs) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 466, in loss label_channels=label_channels) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 348, in get_targets unmap_outputs=unmap_outputs) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/utils/misc.py", line 29, in multi_apply return tuple(map(list, zip(*map_results))) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/models/dense_heads/anchor_head.py", line 225, in _get_targets_single gt_bboxes) File "/home/ling/my_projects/panoptic_seg/code/swin2/mmdet/core/bbox/samplers/base_sampler.py", line 97, in sample neg_inds = neg_inds.unique() File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/tensor.py", line 511, in unique return torch.unique(self, sorted=sorted, return_inverse=return_inverse, return_counts=return_counts, dim=dim) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn return if_false(*args, **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/_jit_internal.py", line 365, in fn return if_false(*args, **kwargs) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 831, in _return_output output, _, _ = _unique_impl(input, sorted, return_inverse, return_counts, dim) File "/home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/functional.py", line 749, in _unique_impl return_counts=return_counts, RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554800319/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f3b518972f2 in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f3b5189467b in /home/ling/anaconda3/envs/swin_de/lib/python3.7/site-packages/torch/lib/libc10.so)
Do you by chance know why? Thanks :)
classes = tuple(['hands'])
I wrote classes like this and it worked
HI, So am training with just one class, in coco.py i set
CLASSES = ('person')
but later on when checking the consistency of class number
len(dataset.CLASSES) = len('person') = 6
but if the class are more than one, its fine.. coz dataset.CLASSES is now a tuple