FZJ-INM1-BDA / celldetection

Scalable Instance Segmentation using PyTorch & PyTorch Lightning.
https://docs.celldetection.org
Apache License 2.0
125 stars 21 forks source link

Potential Bug in CPN? #3

Closed ppriyank closed 3 years ago

ppriyank commented 3 years ago

https://github.com/FZJ-INM1-BDA/celldetection/blob/main/demos/Cell%20Detection%20with%20Contour%20Proposal%20Networks.ipynb

In the above tutorial, when I replace cpn='CpnU22', in conf = cd.Config(....) After 1 epoch of training, on the second epoch I get the following error :

Epoch 2/100 - loss 12.061:  56%|███████████████████████████████████████████████████▍                                        | 286/512 [02:23<01:54,  1.98it/s]
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [55,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "ind
ex out of bounds"` failed.                                                    
...
...
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [85,0,0], thread: [94,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [85,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
...
...

Traceback (most recent call last):
  File "train2.py", line 231, in <module>
    train_epoch(model, train_loader, conf.device, optimizer, f'Epoch {epoch}/{conf.epochs}', scaler, scheduler)
  File "train2.py", line 212, in train_epoch
    outputs: dict = model(batch['inputs'], targets=batch)
  File "/home/ppriyank/anaconda3/envs/pathak/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ppriyank/covid_cell/celldetection/celldetection/models/cpn.py", line 441, in forward
    buckets = resolve_refinement_buckets(sampling, self.core.refinement_buckets)
  File "/home/ppriyank/covid_cell/celldetection/celldetection/ops/cpn.py", line 203, in resolve_refinement_buckets
    (a % num_buckets, refinement_bucket_weight(a, base_index)),
  File "/home/ppriyank/covid_cell/celldetection/celldetection/ops/cpn.py", line 193, in refinement_bucket_weight
    dist[sel] = 0
RuntimeError: CUDA error: device-side assert triggered
ericup commented 3 years ago

Thank you for your feedback! I suspect this is a duplicate of #1. Would you please confirm this by testing the suggested workaround?

ppriyank commented 3 years ago

Oh yeh, this is the same issue, apologies