facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.25k stars 331 forks source link

CUDA error: device-side assert triggered error when I run clusterfit #434

Closed doulemint closed 2 years ago

doulemint commented 2 years ago

I'm trying to rerun clusterfit on the imagetnet1k dataset.

I directly use pretrain config file. Here are my common line !python tools/run_distributed_engines.py \ config=pretrain/clusterfit/cluster_features_resnet_8gpu_imagenet \ config.CHECKPOINT.DIR="./checkpoints_feature" \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""\ config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch \ config.CLUSTERFIT.OUTPUT_DIR="./output" config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk.base_model._feature_blocks." \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/converted_vissl_rn50_rotnet_16kclusters_in1k_ep105.torch"

but I got this error:

INFO 2021-10-05 03:49:34,097 state_update_hooks.py: 113: Starting phase 0 [train] INFO 2021-10-05 03:49:34,592 log_hooks.py: 77: ========= Memory Summary at on_forward ======= =========================================================================== PyTorch CUDA memory summary, device ID 0
CUDA OOMs: 0 cudaMalloc retries: 0
===========================================================================
Metric Cur Usage Peak Usage Tot Alloc Tot Freed
---------------------------------------------------------------------------
Allocated memory 129960 KB 2619 MB 20332 MB 20205 MB
from large pool 111872 KB 2601 MB 20314 MB 20205 MB
from small pool 18088 KB 17 MB 17 MB 0 MB
---------------------------------------------------------------------------
Active memory 129960 KB 2619 MB 20332 MB 20205 MB
from large pool 111872 KB 2601 MB 20314 MB 20205 MB
from small pool 18088 KB 17 MB 17 MB 0 MB
---------------------------------------------------------------------------
GPU reserved memory 2838 MB 3956 MB 11826 MB 8988 MB
from large pool 2818 MB 3936 MB 11806 MB 8988 MB
from small pool 20 MB 20 MB 20 MB 0 MB
---------------------------------------------------------------------------
Non-releasable memory 33879 KB 1443 MB 12990 MB 12957 MB
from large pool 31488 KB 1441 MB 12973 MB 12942 MB
from small pool 2391 KB 2 MB 17 MB 15 MB
---------------------------------------------------------------------------
Allocations 327 331 513 186
from large pool 19 24 163 144
from small pool 308 308 350 42
---------------------------------------------------------------------------
Active allocs 327 331 513 186
from large pool 19 24 163 144
from small pool 308 308 350 42
---------------------------------------------------------------------------
GPU reserved segments 19 21 29 10
from large pool 9 11 19 10
from small pool 10 10 10 0
---------------------------------------------------------------------------
Non-releasable allocs 10 14 86 76
from large pool 7 11 74 67
from small pool 3 5 12 9
===========================================================================

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed. --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('CUDA error: device-side assert triggered'),) --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed logging.error("Wrapping up, caught exception: ", e) Message: 'Wrapping up, caught exception: ' Arguments: (RuntimeError('CUDA error: device-side assert triggered'),) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed raise e File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

I knew this error has something to do with some dimensions I set in the config file. But I have no clue which configuration setting I should change.

doulemint commented 2 years ago

probably Because I used imagetnet-mini. it doesn't have enough images.. because the clusterfit needs 16k clusters, the pretrain model can not fit a small-sized dataset?

doulemint commented 2 years ago

I tried to run on cpu to see what caused this issue. And it gave me a new error. hope this one will help :

INFO 2021-10-05 05:09:39,189 trainer_main.py: 323: Phase advanced. Rank: 0 INFO 2021-10-05 05:09:39,190 state_update_hooks.py: 113: Starting phase 0 [train] task lossssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss: torch.Size([32, 2048]) torch.Size([32]) Traceback (most recent call last): File "tools/run_distributed_engines.py", line 58, in hydra_main(overrides=overrides) File "tools/run_distributed_engines.py", line 46, in hydra_main hook_generator=default_hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 159, in standard_train_step local_loss = task.loss(model_output, target) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 962, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2468, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2264, in nll_loss ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) IndexError: Target 34200 is out of bounds.

PanHaulin commented 2 years ago

I also meet this problem, can you tell me how you solved it? @doulemint