Closed doulemint closed 2 years ago
probably Because I used imagetnet-mini. it doesn't have enough images.. because the clusterfit needs 16k clusters, the pretrain model can not fit a small-sized dataset?
I tried to run on cpu to see what caused this issue. And it gave me a new error. hope this one will help :
INFO 2021-10-05 05:09:39,189 trainer_main.py: 323: Phase advanced. Rank: 0
INFO 2021-10-05 05:09:39,190 state_update_hooks.py: 113: Starting phase 0 [train]
task lossssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss: torch.Size([32, 2048]) torch.Size([32])
Traceback (most recent call last):
File "tools/run_distributed_engines.py", line 58, in
I also meet this problem, can you tell me how you solved it? @doulemint
I'm trying to rerun clusterfit on the imagetnet1k dataset.
I directly use pretrain config file. Here are my common line
!python tools/run_distributed_engines.py \ config=pretrain/clusterfit/cluster_features_resnet_8gpu_imagenet \ config.CHECKPOINT.DIR="./checkpoints_feature" \ config.DISTRIBUTED.NUM_NODES=1 \ config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \ config.MODEL.WEIGHTS_INIT.STATE_DICT_KEY_NAME=""\ config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch \ config.CLUSTERFIT.OUTPUT_DIR="./output" config.MODEL.WEIGHTS_INIT.APPEND_PREFIX="trunk.base_model._feature_blocks." \ config.MODEL.WEIGHTS_INIT.PARAMS_FILE="/content/converted_vissl_rn50_rotnet_16kclusters_in1k_ep105.torch"
but I got this error:
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion
t >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [10,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [17,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [18,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [20,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [21,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [22,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [23,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [26,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [27,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [28,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [29,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertiont >= 0 && t < n_classes
failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertiont >= 0 && t < n_classes
failed. --- Logging error --- Traceback (most recent call last): File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed hook_generator=hook_generator, File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker hook_generator=hook_generator, File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 46, in run_engine hook_generator=hook_generator, File "/content/vissl/vissl/engines/train.py", line 130, in train_main trainer.train() File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train raise e File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train task = train_step_fn(task) File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0)) RuntimeError: CUDA error: device-side assert triggeredDuring handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('CUDA error: device-side assert triggered'),)
--- Logging error ---
Traceback (most recent call last):
File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed
hook_generator=hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 46, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 130, in train_main
trainer.train()
File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train
raise e
File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train
task = train_step_fn(task)
File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step
task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0))
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.7/logging/init.py", line 1025, in emit msg = self.format(record) File "/usr/lib/python3.7/logging/init.py", line 869, in format return fmt.format(record) File "/usr/lib/python3.7/logging/init.py", line 608, in format record.message = record.getMessage() File "/usr/lib/python3.7/logging/init.py", line 369, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "tools/run_distributed_engines.py", line 58, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 162, in launch_distributed
logging.error("Wrapping up, caught exception: ", e)
Message: 'Wrapping up, caught exception: '
Arguments: (RuntimeError('CUDA error: device-side assert triggered'),)
Traceback (most recent call last):
File "tools/run_distributed_engines.py", line 58, in
hydra_main(overrides=overrides)
File "tools/run_distributed_engines.py", line 46, in hydra_main
hook_generator=default_hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 164, in launch_distributed
raise e
File "/content/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed
hook_generator=hook_generator,
File "/content/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 46, in run_engine
hook_generator=hook_generator,
File "/content/vissl/vissl/engines/train.py", line 130, in train_main
trainer.train()
File "/content/vissl/vissl/trainer/trainer_main.py", line 201, in train
raise e
File "/content/vissl/vissl/trainer/trainer_main.py", line 193, in train
task = train_step_fn(task)
File "/content/vissl/vissl/trainer/train_steps/standard_train_step.py", line 165, in standard_train_step
task.losses.append(task.last_batch.loss.data.cpu().item() * target.size(0))
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
I knew this error has something to do with some dimensions I set in the config file. But I have no clue which configuration setting I should change.