Closed yvokeller closed 4 months ago
Insights:
Current error when trying 1 node 2 GPUs multi GPU Training (2 data loader workers) Multi-GPU Error:
Everything is up to date. Stage 'download-satellite-data' didn't change, skipping 'data/ZH_2019_LW_NUTZUNGSFLAECHEN_F.dbf.dvc' didn't change, skipping 'data/ZH_2019_LW_NUTZUNGSFLAECHEN_F.prj.dvc' didn't change, skipping 'data/ZH_2019_LW_NUTZUNGSFLAECHEN_F.shp.dvc' didn't change, skipping 'data/ZH_2019_LW_NUTZUNGSFLAECHEN_F.shx.dvc' didn't change, skipping 'data/TG_2019_Nutzungsflaechen_2019.gpkg.dvc' didn't change, skipping 'data/georef-switzerland-kanton@public.geojson.dvc' didn't change, skipping 'data/labels_hierarchy.csv.dvc' didn't change, skipping Stage 'rasterize' didn't change, skipping Stage 'prepare-dataset' didn't change, skipping Stage 'calculate-dataset-stats' didn't change, skipping Stage 'download-prithvi-base-model' didn't change, skipping Data and pipelines are up to date. [NbConvertApp] Converting notebook model_training.ipynb to script [NbConvertApp] Writing 2781 bytes to model_training.py Warning: Unexpected keys in the state dict: ['mask_token', 'decoder_pos_embed', 'decoder_embed.weight', 'decoder_embed.bias', 'decoder_blocks.0.norm1.weight', 'decoder_blocks.0.norm1.bias', 'decoder_blocks.0.attn.qkv.weight', 'decoder_blocks.0.attn.qkv.bias', 'decoder_blocks.0.attn.proj.weight', 'decoder_blocks.0.attn.proj.bias', 'decoder_blocks.0.norm2.weight', 'decoder_blocks.0.norm2.bias', 'decoder_blocks.0.mlp.fc1.weight', 'decoder_blocks.0.mlp.fc1.bias', 'decoder_blocks.0.mlp.fc2.weight', 'decoder_blocks.0.mlp.fc2.bias', 'decoder_blocks.1.norm1.weight', 'decoder_blocks.1.norm1.bias', 'decoder_blocks.1.attn.qkv.weight', 'decoder_blocks.1.attn.qkv.bias', 'decoder_blocks.1.attn.proj.weight', 'decoder_blocks.1.attn.proj.bias', 'decoder_blocks.1.norm2.weight', 'decoder_blocks.1.norm2.bias', 'decoder_blocks.1.mlp.fc1.weight', 'decoder_blocks.1.mlp.fc1.bias', 'decoder_blocks.1.mlp.fc2.weight', 'decoder_blocks.1.mlp.fc2.bias', 'decoder_blocks.2.norm1.weight', 'decoder_blocks.2.norm1.bias', 'decoder_blocks.2.attn.qkv.weight', 'decoder_blocks.2.attn.qkv.bias', 'decoder_blocks.2.attn.proj.weight', 'decoder_blocks.2.attn.proj.bias', 'decoder_blocks.2.norm2.weight', 'decoder_blocks.2.norm2.bias', 'decoder_blocks.2.mlp.fc1.weight', 'decoder_blocks.2.mlp.fc1.bias', 'decoder_blocks.2.mlp.fc2.weight', 'decoder_blocks.2.mlp.fc2.bias', 'decoder_blocks.3.norm1.weight', 'decoder_blocks.3.norm1.bias', 'decoder_blocks.3.attn.qkv.weight', 'decoder_blocks.3.attn.qkv.bias', 'decoder_blocks.3.attn.proj.weight', 'decoder_blocks.3.attn.proj.bias', 'decoder_blocks.3.norm2.weight', 'decoder_blocks.3.norm2.bias', 'decoder_blocks.3.mlp.fc1.weight', 'decoder_blocks.3.mlp.fc1.bias', 'decoder_blocks.3.mlp.fc2.weight', 'decoder_blocks.3.mlp.fc2.bias', 'decoder_blocks.4.norm1.weight', 'decoder_blocks.4.norm1.bias', 'decoder_blocks.4.attn.qkv.weight', 'decoder_blocks.4.attn.qkv.bias', 'decoder_blocks.4.attn.proj.weight', 'decoder_blocks.4.attn.proj.bias', 'decoder_blocks.4.norm2.weight', 'decoder_blocks.4.norm2.bias', 'decoder_blocks.4.mlp.fc1.weight', 'decoder_blocks.4.mlp.fc1.bias', 'decoder_blocks.4.mlp.fc2.weight', 'decoder_blocks.4.mlp.fc2.bias', 'decoder_blocks.5.norm1.weight', 'decoder_blocks.5.norm1.bias', 'decoder_blocks.5.attn.qkv.weight', 'decoder_blocks.5.attn.qkv.bias', 'decoder_blocks.5.attn.proj.weight', 'decoder_blocks.5.attn.proj.bias', 'decoder_blocks.5.norm2.weight', 'decoder_blocks.5.norm2.bias', 'decoder_blocks.5.mlp.fc1.weight', 'decoder_blocks.5.mlp.fc1.bias', 'decoder_blocks.5.mlp.fc2.weight', 'decoder_blocks.5.mlp.fc2.bias', 'decoder_blocks.6.norm1.weight', 'decoder_blocks.6.norm1.bias', 'decoder_blocks.6.attn.qkv.weight', 'decoder_blocks.6.attn.qkv.bias', 'decoder_blocks.6.attn.proj.weight', 'decoder_blocks.6.attn.proj.bias', 'decoder_blocks.6.norm2.weight', 'decoder_blocks.6.norm2.bias', 'decoder_blocks.6.mlp.fc1.weight', 'decoder_blocks.6.mlp.fc1.bias', 'decoder_blocks.6.mlp.fc2.weight', 'decoder_blocks.6.mlp.fc2.bias', 'decoder_blocks.7.norm1.weight', 'decoder_blocks.7.norm1.bias', 'decoder_blocks.7.attn.qkv.weight', 'decoder_blocks.7.attn.qkv.bias', 'decoder_blocks.7.attn.proj.weight', 'decoder_blocks.7.attn.proj.bias', 'decoder_blocks.7.norm2.weight', 'decoder_blocks.7.norm2.bias', 'decoder_blocks.7.mlp.fc1.weight', 'decoder_blocks.7.mlp.fc1.bias', 'decoder_blocks.7.mlp.fc2.weight', 'decoder_blocks.7.mlp.fc2.bias', 'decoder_norm.weight', 'decoder_norm.bias', 'decoder_pred.weight', 'decoder_pred.bias'] Loaded pretrained weights from './prithvi/models/Prithvi_100M.pt' with partial matching. Warning: Unexpected keys in the state dict: ['mask_token', 'decoder_pos_embed', 'decoder_embed.weight', 'decoder_embed.bias', 'decoder_blocks.0.norm1.weight', 'decoder_blocks.0.norm1.bias', 'decoder_blocks.0.attn.qkv.weight', 'decoder_blocks.0.attn.qkv.bias', 'decoder_blocks.0.attn.proj.weight', 'decoder_blocks.0.attn.proj.bias', 'decoder_blocks.0.norm2.weight', 'decoder_blocks.0.norm2.bias', 'decoder_blocks.0.mlp.fc1.weight', 'decoder_blocks.0.mlp.fc1.bias', 'decoder_blocks.0.mlp.fc2.weight', 'decoder_blocks.0.mlp.fc2.bias', 'decoder_blocks.1.norm1.weight', 'decoder_blocks.1.norm1.bias', 'decoder_blocks.1.attn.qkv.weight', 'decoder_blocks.1.attn.qkv.bias', 'decoder_blocks.1.attn.proj.weight', 'decoder_blocks.1.attn.proj.bias', 'decoder_blocks.1.norm2.weight', 'decoder_blocks.1.norm2.bias', 'decoder_blocks.1.mlp.fc1.weight', 'decoder_blocks.1.mlp.fc1.bias', 'decoder_blocks.1.mlp.fc2.weight', 'decoder_blocks.1.mlp.fc2.bias', 'decoder_blocks.2.norm1.weight', 'decoder_blocks.2.norm1.bias', 'decoder_blocks.2.attn.qkv.weight', 'decoder_blocks.2.attn.qkv.bias', 'decoder_blocks.2.attn.proj.weight', 'decoder_blocks.2.attn.proj.bias', 'decoder_blocks.2.norm2.weight', 'decoder_blocks.2.norm2.bias', 'decoder_blocks.2.mlp.fc1.weight', 'decoder_blocks.2.mlp.fc1.bias', 'decoder_blocks.2.mlp.fc2.weight', 'decoder_blocks.2.mlp.fc2.bias', 'decoder_blocks.3.norm1.weight', 'decoder_blocks.3.norm1.bias', 'decoder_blocks.3.attn.qkv.weight', 'decoder_blocks.3.attn.qkv.bias', 'decoder_blocks.3.attn.proj.weight', 'decoder_blocks.3.attn.proj.bias', 'decoder_blocks.3.norm2.weight', 'decoder_blocks.3.norm2.bias', 'decoder_blocks.3.mlp.fc1.weight', 'decoder_blocks.3.mlp.fc1.bias', 'decoder_blocks.3.mlp.fc2.weight', 'decoder_blocks.3.mlp.fc2.bias', 'decoder_blocks.4.norm1.weight', 'decoder_blocks.4.norm1.bias', 'decoder_blocks.4.attn.qkv.weight', 'decoder_blocks.4.attn.qkv.bias', 'decoder_blocks.4.attn.proj.weight', 'decoder_blocks.4.attn.proj.bias', 'decoder_blocks.4.norm2.weight', 'decoder_blocks.4.norm2.bias', 'decoder_blocks.4.mlp.fc1.weight', 'decoder_blocks.4.mlp.fc1.bias', 'decoder_blocks.4.mlp.fc2.weight', 'decoder_blocks.4.mlp.fc2.bias', 'decoder_blocks.5.norm1.weight', 'decoder_blocks.5.norm1.bias', 'decoder_blocks.5.attn.qkv.weight', 'decoder_blocks.5.attn.qkv.bias', 'decoder_blocks.5.attn.proj.weight', 'decoder_blocks.5.attn.proj.bias', 'decoder_blocks.5.norm2.weight', 'decoder_blocks.5.norm2.bias', 'decoder_blocks.5.mlp.fc1.weight', 'decoder_blocks.5.mlp.fc1.bias', 'decoder_blocks.5.mlp.fc2.weight', 'decoder_blocks.5.mlp.fc2.bias', 'decoder_blocks.6.norm1.weight', 'decoder_blocks.6.norm1.bias', 'decoder_blocks.6.attn.qkv.weight', 'decoder_blocks.6.attn.qkv.bias', 'decoder_blocks.6.attn.proj.weight', 'decoder_blocks.6.attn.proj.bias', 'decoder_blocks.6.norm2.weight', 'decoder_blocks.6.norm2.bias', 'decoder_blocks.6.mlp.fc1.weight', 'decoder_blocks.6.mlp.fc1.bias', 'decoder_blocks.6.mlp.fc2.weight', 'decoder_blocks.6.mlp.fc2.bias', 'decoder_blocks.7.norm1.weight', 'decoder_blocks.7.norm1.bias', 'decoder_blocks.7.attn.qkv.weight', 'decoder_blocks.7.attn.qkv.bias', 'decoder_blocks.7.attn.proj.weight', 'decoder_blocks.7.attn.proj.bias', 'decoder_blocks.7.norm2.weight', 'decoder_blocks.7.norm2.bias', 'decoder_blocks.7.mlp.fc1.weight', 'decoder_blocks.7.mlp.fc1.bias', 'decoder_blocks.7.mlp.fc2.weight', 'decoder_blocks.7.mlp.fc2.bias', 'decoder_norm.weight', 'decoder_norm.bias', 'decoder_pred.weight', 'decoder_pred.bias'] Loaded pretrained weights from './prithvi/models/Prithvi_100M.pt' with partial matching. Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs You are using a CUDA device ('NVIDIA GeForce RTX 3080') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes ---------------------------------------------------------------------------------------------------- wandb: Currently logged in as: crop-classification. Use `wandb login --relogin` to force relogin wandb: wandb version 0.17.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.16.6 wandb: Run data is saved locally in ./wandb/run-20240531_131825-gdyk07d2 wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run faithful-bee-232 wandb: ⭐️ View project at https://wandb.ai/crop-classification/messis wandb: 🚀 View run at https://wandb.ai/crop-classification/messis/runs/gdyk07d2 [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: 4351562648022572566. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75b39977a897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x75b34d27b1b2 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x75b34d27ffd0 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x75b34d28131c in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdc253 (0x75b398cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x75b39c494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x75b39c526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75b39977a897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x75b34d27b1b2 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x75b34d27ffd0 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x75b34d28131c in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdc253 (0x75b398cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x75b39c494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x75b39c526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75b39977a897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe32e33 (0x75b34cf03e33 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x75b398cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x75b39c494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x126850 (0x75b39c526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2. [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7292902cf897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x729243e7b1b2 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x729243e7ffd0 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x729243e8131c in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdc253 (0x72928f8dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x729292e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x729292f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7292902cf897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x729243e7b1b2 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x729243e7ffd0 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x729243e8131c in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdc253 (0x72928f8dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x729292e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x729292f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7292902cf897 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe32e33 (0x729243b03e33 in /mnt/nas05/clusterdata01/home2/yvo/.cache/pypoetry/virtualenvs/messis-sGO7Z3bZ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xdc253 (0x72928f8dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x729292e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x126850 (0x729292f26850 in /lib/x86_64-linux-gnu/libc.so.6) srun: error: gpu22a: task 0: Aborted (core dumped) srun: error: gpu22a: task 1: Aborted (core dumped)
Insights:
Errors
Current error when trying 1 node 2 GPUs multi GPU Training (2 data loader workers) Multi-GPU Error: