CUDA error: device-side assert triggered error when I run clusterfit

doulemint commented 2 years ago

I'm trying to rerun clusterfit on the imagetnet1k dataset.

I directly use pretrain config file. Here are my common line !python tools/run_distributed_engines.py \ config=pretrain/clusterfit/cluster_features_resnet_8gpu_imagenet \ config.CHECKPOINT.DIR="./checkpoints_feature" \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""\ config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch \ config.CLUSTERFIT.OUTPUT_DIR="./output" config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk.base_model._feature_blocks." \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/converted_vissl_rn50_rotnet_16kclusters_in1k_ep105.torch"

but I got this error:

INFO 2021-10-05 03:49:34,097 state_update_hooks.py: 113: Starting phase 0 [train] INFO 2021-10-05 03:49:34,592 log_hooks.py: 77: ========= Memory Summary at on_forward =======	===========================================================================		PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0	cudaMalloc retries: 0
===========================================================================
Metric	Cur Usage	Peak Usage	Tot Alloc	Tot Freed
---------------------------------------------------------------------------
Allocated memory	129960 KB	2619 MB	20332 MB	20205 MB
from large pool	111872 KB	2601 MB	20314 MB	20205 MB
from small pool	18088 KB	17 MB	17 MB	0 MB
---------------------------------------------------------------------------
Active memory	129960 KB	2619 MB	20332 MB	20205 MB
from large pool	111872 KB	2601 MB	20314 MB	20205 MB
from small pool	18088 KB	17 MB	17 MB	0 MB
---------------------------------------------------------------------------
GPU reserved memory	2838 MB	3956 MB	11826 MB	8988 MB
from large pool	2818 MB	3936 MB	11806 MB	8988 MB
from small pool	20 MB	20 MB	20 MB	0 MB
---------------------------------------------------------------------------
Non-releasable memory	33879 KB	1443 MB	12990 MB	12957 MB
from large pool	31488 KB	1441 MB	12973 MB	12942 MB
from small pool	2391 KB	2 MB	17 MB	15 MB
---------------------------------------------------------------------------
Allocations	327	331	513	186
from large pool	19	24	163	144
from small pool	308	308	350	42
---------------------------------------------------------------------------
Active allocs	327	331	513	186
from large pool	19	24	163	144
from small pool	308	308	350	42
---------------------------------------------------------------------------
GPU reserved segments	19	21	29	10
from large pool	9	11	19	10
from small pool	10	10	10	0
---------------------------------------------------------------------------
Non-releasable allocs	10	14	86	76
from large pool	7	11	74	67
from small pool	3	5	12	9
===========================================================================

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed. --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('CUDA error: device-side assert triggered'),) --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('CUDA error: device-side assert triggered'),) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed raise e File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

I knew this error has something to do with some dimensions I set in the config file. But I have no clue which configuration setting I should change.

doulemint commented 2 years ago

probably Because I used imagetnet-mini. it doesn't have enough images.. because the clusterfit needs 16k clusters, the pretrain model can not fit a small-sized dataset?

doulemint commented 2 years ago

I tried to run on cpu to see what caused this issue. And it gave me a new error. hope this one will help :

INFO 2021-10-05 05:09:39,189 trainer_main.py: 323: Phase advanced. Rank: 0 INFO 2021-10-05 05:09:39,190 state_update_hooks.py: 113: Starting phase 0 [train] task lossssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss: torch.Size([32, 2048]) torch.Size([32]) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 159, in standard_train_step local_loss = task.loss(model_output, target) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 962, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2468, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2264, in nll_loss ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) IndexError: Target 34200 is out of bounds.

PanHaulin commented 2 years ago

I also meet this problem, can you tell me how you solved it? @doulemint

facebookresearch / vissl

CUDA error: device-side assert triggered error when I run clusterfit #434