I don't think deepspeed is parallelizing training across 3 GPUs that I have on a single node. The following is what I see with just 1 GPU:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[rank: 0] Seed set to 77843
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
0 | model | AlphaFold | 93.2 M | train
1 | loss | AlphaFoldLoss | 0 | train
93.2 M Trainable params
0 Non-trainable params
93.2 M Total params
372.895 Total estimated model params size (MB)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[rank: 0] Seed set to 77843
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
| Name | Type | Params | Mode
0 | model | AlphaFold | 93.2 M | train
1 | loss | AlphaFoldLoss | 0 | train
93.2 M Trainable params
0 Non-trainable params
93.2 M Total params
372.895 Total estimated model params size (MB)
I was able to fix this with executing torchrun and specifying num processes equal to the num of GPUs present..something like this::
torchrun --nproc_per_node=3 train_openfold.py .........
I don't think deepspeed is parallelizing training across 3 GPUs that I have on a single node. The following is what I see with just 1 GPU: GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set
torch.set_float32_matmul_precision('medium' | 'high')
which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision [rank: 0] Seed set to 77843 initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]| Name | Type | Params | Mode
0 | model | AlphaFold | 93.2 M | train 1 | loss | AlphaFoldLoss | 0 | train
93.2 M Trainable params 0 Non-trainable params 93.2 M Total params 372.895 Total estimated model params size (MB)
[2024-09-07 19:12:31,188] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) Trainer device info: Number of devices: 1 Number of nodes: 1 Parallel devices: <pytorch_lightning.strategies.deepspeed.DeepSpeedStrategy object at 0x14eb6ee26b30> Training: | | 0/? [00:00<?, ?it/s] Training: 0%| | 0/5 [00:00<?, ?it/s] Epoch 0: 0%| | 0/5 [00:00<?, ?it/s] Epoch 0: 20%|██ | 1/5 [00:28<01:52, 0.04it/s] Epoch 0: 20%|██ | 1/5 [00:28<01:52, 0.04it/s, v_num=nkwx, train/loss=61.80] Epoch 0: 40%|████ | 2/5 [01:49<02:43, 0.02it/s, v_num=nkwx, train/loss=61.80] Epoch 0: 40%|████ | 2/5 [01:49<02:43, 0.02it/s, v_num=nkwx, train/loss=47.70] Epoch 0: 60%|██████ | 3/5 [02:43<01:49, 0.02it/s, v_num=nkwx, train/loss=47.70] Epoch 0: 60%|██████ | 3/5 [02:43<01:49, 0.02it/s, v_num=nkwx, train/loss=232.0] Epoch 0: 80%|████████ | 4/5 [03:45<00:56, 0.02it/s, v_num=nkwx, train/loss=232.0] Epoch 0: 80%|████████ | 4/5 [03:45<00:56, 0.02it/s, v_num=nkwx, train/loss=52.10] Epoch 0: 100%|██████████| 5/5 [04:02<00:00, 0.02it/s, v_num=nkwx, train/loss=52.10] Epoch 0: 100%|██████████| 5/5 [04:02<00:00, 0.02it/s, v_num=nkwx, train/loss=45.60]
The following is what I see with 3 GPUs:
GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set
torch.set_float32_matmul_precision('medium' | 'high')
which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision [rank: 0] Seed set to 77843 initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]| Name | Type | Params | Mode
0 | model | AlphaFold | 93.2 M | train 1 | loss | AlphaFoldLoss | 0 | train
93.2 M Trainable params 0 Non-trainable params 93.2 M Total params 372.895 Total estimated model params size (MB)
[2024-09-07 19:20:58,134] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) Trainer device info: Number of devices: 3 Number of nodes: 1 Parallel devices: <pytorch_lightning.strategies.deepspeed.DeepSpeedStrategy object at 0x1467e5e0ab60> Training: | | 0/? [00:00<?, ?it/s] Training: 0%| | 0/5 [00:00<?, ?it/s] Epoch 0: 0%| | 0/5 [00:00<?, ?it/s] Epoch 0: 20%|██ | 1/5 [00:28<01:52, 0.04it/s] Epoch 0: 20%|██ | 1/5 [00:28<01:52, 0.04it/s, v_num=00a4, train/loss=61.80] Epoch 0: 40%|████ | 2/5 [01:48<02:43, 0.02it/s, v_num=00a4, train/loss=61.80] Epoch 0: 40%|████ | 2/5 [01:48<02:43, 0.02it/s, v_num=00a4, train/loss=47.70] Epoch 0: 60%|██████ | 3/5 [02:43<01:49, 0.02it/s, v_num=00a4, train/loss=47.70] Epoch 0: 60%|██████ | 3/5 [02:43<01:49, 0.02it/s, v_num=00a4, train/loss=232.0] Epoch 0: 80%|████████ | 4/5 [03:46<00:56, 0.02it/s, v_num=00a4, train/loss=232.0] Epoch 0: 80%|████████ | 4/5 [03:46<00:56, 0.02it/s, v_num=00a4, train/loss=52.10] Epoch 0: 100%|██████████| 5/5 [04:03<00:00, 0.02it/s, v_num=00a4, train/loss=52.10] Epoch 0: 100%|██████████| 5/5 [04:03<00:00, 0.02it/s, v_num=00a4, train/loss=45.60]
The following are my input arguments for 3GPU training:
python3 train_openfold.py $DATA_DIR/mmcif_dir/train $DATA_DIR/alignment_dir/train $TEMPLATE_MMCIF_DIR/mmcif_files $DATA_DIR/output_dir 2021-09-30 \ --config_preset model_1_multimer_v3 \ --template_release_dates_cache_path $CACHE_DIR/template_mmcif_cache.json \ --seed 77843 \ --obsolete_pdbs_file_path $TEMPLATE_MMCIF_DIR/obsolete.dat \ --num_nodes 1 \ --resume_from_jax_params $CHECKPOINT_PATH \ --resume_model_weights_only False \ --train_mmcif_data_cache_path $CACHE_DIR/train_mmcif_cache.json \ --val_mmcif_data_cache_path $CACHE_DIR/val_mmcif_cache.json \ --val_data_dir $DATA_DIR/mmcif_dir/val \ --val_alignment_dir $DATA_DIR/alignment_dir/val \ --gpus 3 \ --train_epoch_len 5 \ --max_epochs 1 \ --checkpoint_every_epoch \ --precision 32 \ --num_sanity_val_steps 0 \ --log_performance False \ --wandb \ --log_every_n_steps 1 \ --log_lr \ --wandb_project openfold_training \ --experiment_name full_train \ --mpi_plugin \ --deepspeed_config_path ./deepspeed_config.json
Is there a fix for this? Am I doing something wrong?