SegmentationBLWX / sssegmentation

SSSegmentation: An Open Source Supervised Semantic Segmentation Toolbox Based on PyTorch.
https://sssegmentation.readthedocs.io/en/latest/
Apache License 2.0
775 stars 107 forks source link

error during training #47

Closed umarjibrilmohd closed 7 months ago

umarjibrilmohd commented 8 months ago

(venv) mohammed@c24032:~/sssegmentation$ bash scripts/dist_train.sh 4 ssseg/configs/annnet/annnet_resnet50os16_ade20k.py


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2024-01-10 08:33:59 WARNING ngpus_per_node is not equal to nproc_per_node, force ngpus_per_node = nproc_per_node by default Traceback (most recent call last): File "ssseg/train.py", line 252, in main() File "ssseg/train.py", line 247, in main client.start() File "ssseg/train.py", line 70, in start torch.cuda.set_device(cmd_args.local_rank) File "/home/mohammed/model/new ss/venv/lib/python3.8/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device)

umarjibrilmohd commented 8 months ago

I also got this installation conflict while installing torch.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchtext 0.8.1 requires torch==1.7.1, but you have torch 1.8.0 which is incompatible. Successfully installed torch-1.8.0 torchaudio-0.8.0 torchvision-0.9.0

umarjibrilmohd commented 8 months ago

im using python 3.8 with the following details

"/home/mohammed/model/new ss/venv/bin/python" /home/mohammed/sssegmentation/ssseg/set.py CUDA Available: True CUDA Version: 10.2 PyTorch Version: 1.8.0

Process finished with exit code 0

CharlesPikachu commented 8 months ago

please refer to official document to install Pytorch

umarjibrilmohd commented 8 months ago

mohammed@c24032:~/sssegmentation$ bash scripts/dist_train.sh 4 ssseg/configs/annnet/annnet_resnet50os16_ade20k.py


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2024-01-11 10:43:57 WARNING ngpus_per_node is not equal to nproc_per_node, force ngpus_per_node = nproc_per_node by default Traceback (most recent call last): File "ssseg/train.py", line 253, in main() File "ssseg/train.py", line 248, in main client.start() File "ssseg/train.py", line 71, in start torch.cuda.set_device(cmd_args.local_rank) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "ssseg/train.py", line 253, in main() File "ssseg/train.py", line 248, in main client.start() File "ssseg/train.py", line 71, in start torch.cuda.set_device(cmd_args.local_rank) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/cuda/init.py", line 261, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Killing subprocess 1287381 Killing subprocess 1287382 Killing subprocess 1287383 Killing subprocess 1287384 Traceback (most recent call last): File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/mohammed/miniconda3/envs/myenv/bin/python', '-u', 'ssseg/train.py', '--local_rank=3', '--nproc_per_node', '4', '--cfgfilepath', 'ssseg/configs/annnet/annnet_resnet50os16_ade20k.py']' returned non-zero exit status 1.

umarjibrilmohd commented 8 months ago

still during the training

umarjibrilmohd commented 8 months ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A | | 0% 46C P8 21W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:06:00.0 Off | N/A | | 0% 48C P8 20W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ im having 2 gpu on my remote server

CharlesPikachu commented 8 months ago

please set num_gpus as 2, like bash scripts/dist_train.sh 2 xxx.config

umarjibrilmohd commented 8 months ago

File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: CUDA out of memory. Tried to allocate 1.17 GiB (GPU 0; 10.76 GiB total capacity; 7.37 GiB already allocated; 525.44 MiB free; 9.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1351945) of binary: /home/mohammed/miniconda3/envs/myenv/bin/python Traceback (most recent call last): File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ssseg/train.py FAILED

Failures: [1]: time : 2024-01-11_13:39:07 host : c24032 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1351946) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-01-11_13:39:07 host : c24032 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1351945) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

umarjibrilmohd commented 8 months ago

this could be related to memory issue, where can find the batch size to reduce it?

umarjibrilmohd commented 8 months ago

after solving the memory problem, i will like use my own single class dataset containing only images and annotations, how am i going to solve the txt files (objectInfo150.txt and sceneCategories.txt)?

umarjibrilmohd commented 8 months ago

i have 2 questions above

  1. the memory related (where the batch size is located in order to reduce it)
  2. how to do i modify to fit my own dataset?
umarjibrilmohd commented 8 months ago

20200602_0005g_label_5

umarjibrilmohd commented 8 months ago

20200602_0005g_label_5

CharlesPikachu commented 8 months ago

https://sssegmentation.readthedocs.io/en/latest/Tutorials.html#customize-datasets

umarjibrilmohd commented 8 months ago

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 10.76 GiB total capacity; 2.12 GiB already allocated; 38.44 MiB free; 2.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1484169 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1484168) of binary: /home/mohammed/miniconda3/envs/myenv/bin/python Traceback (most recent call last): File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ssseg/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-15_05:01:02 host : c24032 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1484168) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ mohammed@c24032:~/sssegmentation$
umarjibrilmohd commented 8 months ago

please what could be the course of this error and how to address it?

umarjibrilmohd commented 8 months ago

im on a remote server with large memory capacity

umarjibrilmohd commented 8 months ago

usually i resolve this issue by reducing the batch size which i could not find here.

CharlesPikachu commented 8 months ago

you can modify the batch size like,

SEGMENTOR_CFG['dataloader']['expected_total_train_bs_for_assert'] = 2

note that, you should adjust the learning rate accordingly if you adjust the total bs

umarjibrilmohd commented 8 months ago

i modified the BS from default dataloader and LR from basescheduler but the above error didt change.

umarjibrilmohd commented 8 months ago

'RandomCrop': {'crop_size': (256, 256)} 'Padding': {'output_size': (256, 256)}

the error has gone by adjusting to this, thanks for your giuude. i will still get back.

umarjibrilmohd commented 7 months ago

2024-01-21 13:29:20 INFO Config file path: /home/mohammed/sssegmentation/ssseg/configs/annnet/annnet_resnet50os16_ade20k.py 2024-01-21 13:29:20 INFO Config details: {'type': 'ANNNet', 'num_classes': 1, 'benchmark': True, 'align_corners': False, 'backend': 'nccl', 'work_dir': 'annnet_resnet50os16_ade20k', 'logfilepath': 'annnet_resnet50os16_ade20k/umar_annnet_resnet50os16_ade20k.log', 'log_interval_iterations': 50, 'eval_interval_epochs': 10, 'save_interval_epochs': 1, 'resultsavepath': 'annnet_resnet50os16_ade20k/umar_annnet_resnet50os16_ade20k_results.pkl', 'norm_cfg': {'type': 'SyncBatchNorm'}, 'act_cfg': {'type': 'ReLU', 'inplace': True}, 'backbone': {'type': 'ResNet', 'depth': 50, 'structure_type': 'resnet50conv3x3stem', 'pretrained': False, 'outstride': 16, 'use_conv3x3_stem': True, 'selected_indices': (2, 3)}, 'head': {'in_channels_list': [1024, 2048], 'transform_channels': 256, 'query_scales': (1,), 'feats_channels': 512, 'key_pool_scales': (1, 3, 6, 8), 'dropout': 0.1}, 'auxiliary': {'in_channels': 1024, 'out_channels': 512, 'dropout': 0.1}, 'losses': {'loss_aux': {'type': 'CrossEntropyLoss', 'scale_factor': 0.4, 'ignore_index': 255, 'reduction': 'mean'}, 'loss_cls': {'type': 'CrossEntropyLoss', 'scale_factor': 1.0, 'ignore_index': 255, 'reduction': 'mean'}}, 'inference': {'mode': 'whole', 'opts': {}, 'tricks': {'multiscale': [1], 'flip': False, 'use_probs_before_resize': False}}, 'scheduler': {'type': 'PolyScheduler', 'max_epochs': 130, 'power': 0.9, 'optimizer': {'type': 'SGD', 'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0005, 'params_rules': {}}}, 'dataset': {'type': 'ADE20kDataset', 'rootdir': '/home/mohammed/sssegmentation/ADE20k', 'train': {'set': 'train', 'data_pipelines': [('Resize', {'output_size': (2048, 512), 'keep_ratio': True, 'scale_range': (0.5, 2.0)}), ('RandomCrop', {'crop_size': (256, 256), 'one_category_max_ratio': 0.75}), ('RandomFlip', {'flip_prob': 0.5}), ('PhotoMetricDistortion', {}), ('Normalize', {'mean': [123.675, 116.28, 103.53], 'std': [58.395, 57.12, 57.375]}), ('ToTensor', {}), ('Padding', {'output_size': (256, 256), 'data_type': 'tensor'})]}, 'test': {'set': 'val', 'data_pipelines': [('Resize', {'output_size': (2048, 512), 'keep_ratio': True, 'scale_range': None}), ('Normalize', {'mean': [123.675, 116.28, 103.53], 'std': [58.395, 57.12, 57.375]}), ('ToTensor', {})]}}, 'dataloader': {'expected_total_train_bs_for_assert': 2, 'auto_adapt_to_expected_train_bs': True, 'train': {'batch_size_per_gpu': 2, 'num_workers_per_gpu': 2, 'shuffle': True, 'pin_memory': True, 'drop_last': True}, 'test': {'batch_size_per_gpu': 1, 'num_workers_per_gpu': 2, 'shuffle': False, 'pin_memory': True, 'drop_last': False}}} 2024-01-21 13:29:20 INFO Resume from: /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [992,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [993,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [994,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [995,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [581,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [582,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [583,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [584,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [585,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [586,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [587,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [588,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [589,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [590,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [591,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [592,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [593,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [825,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [826,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [827,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [828,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [829,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [830,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [831,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [971,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [972,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [973,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [974,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [975,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [976,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [977,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [978,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [979,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [980,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [981,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [982,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [983,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [984,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [985,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [986,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [987,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [988,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [989,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [990,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [991,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [320,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [321,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [322,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [323,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [324,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [325,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [326,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [327,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [328,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [329,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [330,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [331,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [332,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [333,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [334,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [335,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [336,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [337,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [569,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [570,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [571,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [572,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [573,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [574,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [575,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [57,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [58,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [59,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [60,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [61,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [62,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [63,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [459,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [460,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [218,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [219,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [220,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [221,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [222,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [223,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [957,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [958,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [959,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [701,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [702,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [703,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [189,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [190,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [191,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [445,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [446,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [447,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "ssseg/train.py", line 261, in main() File "ssseg/train.py", line 256, in main client.start() File "ssseg/train.py", line 154, in start loss, losses_log_dict = segmentor(images, targets, forward_kwargs) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/mohammed/sssegmentation/ssseg/modules/models/segmentors/annnet/annnet.py", line 63, in forward loss, losses_log_dict = self.customizepredsandlosses( File "/home/mohammed/sssegmentation/ssseg/modules/models/segmentors/base/base.py", line 52, in customizepredsandlosses return self.calculatelosses(predictions=outputs_dict, targets=targets, losses_cfg=losses_cfg, map_preds_to_tgts_dict=map_preds_to_tgts_dict) File "/home/mohammed/sssegmentation/ssseg/modules/models/segmentors/base/base.py", line 188, in calculatelosses losses_log_dict[loss_key] = loss_value.item() RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fec82b80d62 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1c4d3 (0x7fec82de34d3 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7fec82de3ee2 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fec82b6a314 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0x29a469 (0x7fecdf53d469 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xae0341 (0x7fecdfd83341 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7fecdfd83642 in /home/mohammed/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4e0970] frame #8: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1828] frame #9: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #10: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #11: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #12: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #13: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #14: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #15: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #16: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4f1811] frame #17: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x4c9310] frame #18: PyDict_SetItemString + 0x52 (0x581a82 in /home/mohammed/miniconda3/envs/myenv/bin/python) frame #19: PyImport_Cleanup + 0x93 (0x5a6cb3 in /home/mohammed/miniconda3/envs/myenv/bin/python) frame #20: Py_FinalizeEx + 0x71 (0x5a5de1 in /home/mohammed/miniconda3/envs/myenv/bin/python) frame #21: Py_RunMain + 0x112 (0x5a1ab2 in /home/mohammed/miniconda3/envs/myenv/bin/python) frame #22: Py_BytesMain + 0x39 (0x579e89 in /home/mohammed/miniconda3/envs/myenv/bin/python) frame #23: __libc_start_main + 0xf2 (0x7fece1fb3cb2 in /lib/x86_64-linux-gnu/libc.so.6) frame #24: /home/mohammed/miniconda3/envs/myenv/bin/python() [0x579d3d]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1592929 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1592930) of binary: /home/mohammed/miniconda3/envs/myenv/bin/python Traceback (most recent call last):

umarjibrilmohd commented 7 months ago

this is the error im getting using my custom dataset on ade20k, can you help me to address it?

CharlesPikachu commented 7 months ago

please set num_classes as 2, since you are using cross entroy loss