Closed CHNxindong closed 10 months ago
You can try this sudo apt install openmpi-bin openmpi-common libopenmpi-dev
command and run ompi_info --parsable --all | grep cuda
to see if it works
You can try this
sudo apt install openmpi-bin openmpi-common libopenmpi-dev
command and runompi_info --parsable --all | grep cuda
to see if it works
@mingfuyan Thanks for your reply! I will try your mentioned method.
On the other hand, I also tried another method:
With the above steps, the code can run with multi-gpus. But I do not know wheather this method is correct. Actually, I am not familar to mpirun with ddp.
hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \
deepspeed --include localhost:0,1,2,3 --master_port 10090 \
finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train \
--root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \
--local_train_batch_size 256 \
--local_eval_batch_size 128 \
--log_dir exp/tiktok_finuetune/10-29 \
--epochs 90 --deepspeed \
--eval_step 5000 \
--save_step 5000 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 \
--fix_dist_seed \
--loss_target "noise" \
--train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local \
--combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using torchrun
instead
hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \ deepspeed --include localhost:0,1,2,3 --master_port 10090 \ finetune_sdm_yaml.py \ --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \ --do_train \ --root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \ --local_train_batch_size 256 \ --local_eval_batch_size 128 \ --log_dir exp/tiktok_finuetune/10-29 \ --epochs 90 --deepspeed \ --eval_step 5000 \ --save_step 5000 \ --gradient_accumulate_steps 1 \ --learning_rate 2e-4 \ --fix_dist_seed \ --loss_target "noise" \ --train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \ --val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \ --unet_unfreeze_type "all" \ --refer_sdvae \ --ref_null_caption False \ --combine_clip_local \ --combine_use_mask \ --conds "poses" "masks" \ --stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using
torchrun
instead
Thanks for your reply! I will try your mentioned method!
hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \ deepspeed --include localhost:0,1,2,3 --master_port 10090 \ finetune_sdm_yaml.py \ --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \ --do_train \ --root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \ --local_train_batch_size 256 \ --local_eval_batch_size 128 \ --log_dir exp/tiktok_finuetune/10-29 \ --epochs 90 --deepspeed \ --eval_step 5000 \ --save_step 5000 \ --gradient_accumulate_steps 1 \ --learning_rate 2e-4 \ --fix_dist_seed \ --loss_target "noise" \ --train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \ --val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \ --unet_unfreeze_type "all" \ --refer_sdvae \ --ref_null_caption False \ --combine_clip_local \ --combine_use_mask \ --conds "poses" "masks" \ --stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using
torchrun
instead
Hi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus:
(RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
)
Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists.
Could you please kindly give some suggestions? Thanks.
hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \ deepspeed --include localhost:0,1,2,3 --master_port 10090 \ finetune_sdm_yaml.py \ --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \ --do_train \ --root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \ --local_train_batch_size 256 \ --local_eval_batch_size 128 \ --log_dir exp/tiktok_finuetune/10-29 \ --epochs 90 --deepspeed \ --eval_step 5000 \ --save_step 5000 \ --gradient_accumulate_steps 1 \ --learning_rate 2e-4 \ --fix_dist_seed \ --loss_target "noise" \ --train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \ --val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \ --unet_unfreeze_type "all" \ --refer_sdvae \ --ref_null_caption False \ --combine_clip_local \ --combine_use_mask \ --conds "poses" "masks" \ --stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using
torchrun
insteadHi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus: (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling
cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
)Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists.
Could you please kindly give some suggestions? Thanks.
@CHNxindong sorry I had not met this problem :(
hello @CHNxindong , I just use
AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 \ deepspeed --include localhost:0,1,2,3 --master_port 10090 \ finetune_sdm_yaml.py \ --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \ --do_train \ --root_dir /home/yifan/workspace/diffusion/DisCo/disco_pretrain_weight \ --local_train_batch_size 256 \ --local_eval_batch_size 128 \ --log_dir exp/tiktok_finuetune/10-29 \ --epochs 90 --deepspeed \ --eval_step 5000 \ --save_step 5000 \ --gradient_accumulate_steps 1 \ --learning_rate 2e-4 \ --fix_dist_seed \ --loss_target "noise" \ --train_yaml disco_datasets/TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \ --val_yaml disco_datasets/TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \ --unet_unfreeze_type "all" \ --refer_sdvae \ --ref_null_caption False \ --combine_clip_local \ --combine_use_mask \ --conds "poses" "masks" \ --stage1_pretrain_path /home/data/yifan/disco_pretrain_weight/human_attribute_pretrained_model.pt
maybe you can try it or using
torchrun
insteadHi @wyyfffff, I tried your mentioned method but met following problem when used more than 2gpus: (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling
cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
) Note that when I use one or two gpus, It runs normaly. But when used more than 2 gpus, the above problem exists. Could you please kindly give some suggestions? Thanks.@CHNxindong sorry I had not met this problem :(
Thanks anyway!
Thanks for your great work! When I run the code with multi-gpu, I met a problem: When using a single GPU, the code runs normally. However, when running with multiple GPUs using the command 'mpirun -np 2 python finetune_sdm_yaml.py ...', it is observed that all processes are on GPU: 0.
I found the problem I met is the same as #48. However, I tried the method mentioned in this issue, but it didn't solve the problem.
ps: I note that the code run into the following part in utils/dist.py: