Haochen-Wang409 / U2PL

[CVPR'22 & IJCV'24] Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels & Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation
Apache License 2.0
436 stars 61 forks source link

CUDA error: device-side assert triggered #141

Open Mantee0810 opened 1 year ago

Mantee0810 commented 1 year ago

Hi author, I have used this code to train on the VOC dataset with very good results. But when I try to train on Cityscapes dataset, I have the following problem, do you have any thoughts on this? Looking forward to your reply.

/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [190,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [190,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "../../../../train_semi.py", line 679, in <module>
    main()
  File "../../../../train_semi.py", line 187, in main
    train(
  File "../../../../train_semi.py", line 361, in train
    sup_loss = sup_loss_fn([pred_l_large, aux], label_l.clone())
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mtmtmt/Projects/U2PL-main/u2pl/utils/loss_helper.py", line 371, in forward
    loss1 = self._criterion1(main_pred, target)
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/mtmtmt/Projects/U2PL-main/u2pl/utils/loss_helper.py", line 535, in forward
    mask_prob = prob[target, torch.arange(len(target), dtype=torch.long)]
RuntimeError: CUDA error: device-side assert triggered
/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 72903) of binary: /home/mtmtmt/anaconda3/envs/u2pl/bin/python
Traceback (most recent call last):
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/mtmtmt/anaconda3/envs/u2pl/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
Haochen-Wang409 commented 1 year ago

Could you provide a detailed configuration of your setting? For example, the config.yaml, how many GPUs did you use

yuan7021 commented 1 year ago

Hello, I'm having the same problem, how can I fix this please, I'm only running the Cityscapes dataset.