Question about the multi-gpu running: 'mpirun -np ...'

Wangt-CN / DisCo

[CVPR2024] DisCo: Referring Human Dance Generation in Real World

https://disco-dance.github.io/

Apache License 2.0

1.06k stars 114 forks source link

Question about the multi-gpu running: 'mpirun -np ...' #77

Closed CHNxindong closed 10 months ago

CHNxindong commented 11 months ago

Thanks for your great work! When I run the code with multi-gpu, I met a problem: When using a single GPU, the code runs normally. However, when running with multiple GPUs using the command 'mpirun -np 2 python finetune_sdm_yaml.py ...', it is observed that all processes are on GPU: 0.

I found the problem I met is the same as #48. However, I tried the method mentioned in this issue, but it didn't solve the problem.

ps: I note that the code run into the following part in utils/dist.py:

MingfuYAN commented 11 months ago

You can try this sudo apt install openmpi-bin openmpi-common libopenmpi-dev command and run ompi_info --parsable --all | grep cuda to see if it works

CHNxindong commented 11 months ago

You can try this sudo apt install openmpi-bin openmpi-common libopenmpi-dev command and run ompi_info --parsable --all | grep cuda to see if it works

@mingfuyan Thanks for your reply! I will try your mentioned method.

On the other hand, I also tried another method:

Add 'parser.add_argument('--local_rank', default=-1)' into args.py
Use the ddp startup command for running rather than mpirun:

With the above steps, the code can run with multi-gpus. But I do not know wheather this method is correct. Actually, I am not familar to mpirun with ddp.

wyyfffff commented 11 months ago

hello @CHNxindong , I just use

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt

maybe you can try it or using torchrun instead

CHNxindong commented 11 months ago

hello @CHNxindong , I just use

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt

maybe you can try it or using torchrun instead

Thanks for your reply! I will try your mentioned method!

CHNxindong commented 11 months ago

hello @CHNxindong , I just use

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt

maybe you can try it or using torchrun instead

Hi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus: (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream()))

Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists.

Could you please kindly give some suggestions? Thanks.

wyyfffff commented 10 months ago

hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using torchrun instead
Hi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus: (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream()))

Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists.

Could you please kindly give some suggestions? Thanks.

@CHNxindong sorry I had not met this problem :(

CHNxindong commented 10 months ago

hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using torchrun instead
Hi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus: (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())) Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists. Could you please kindly give some suggestions? Thanks.
@CHNxindong sorry I had not met this problem :(

Thanks anyway!