TencentARC / T2I-Adapter

T2I-Adapter
3.43k stars 201 forks source link

torch.nn.parallel.DistributedDataParallel hang on #90

Open Crd1140234468 opened 1 year ago

Crd1140234468 commented 1 year ago

I encountered ‘torch.nn.parallel.DistributedDataParallel hang on’ problemwhen I run the train_depth.py. I found that the program cannot enter the statement "dist._verify_model_across_ranks" image How to solve this problem

Crd1140234468 commented 1 year ago

This is a function inside torch

Crd1140234468 commented 1 year ago

Also, here's the problem I'm having with multiple GPUs

MC-E commented 1 year ago

what's the command you run?

Crd1140234468 commented 1 year ago

what's the command you run?

CUDA_VISIBLE_DEVICES=1,3 python -m torch.distributed.launch --nproc_per_node=2 --master_port 8888 test11.py --bsize=8

Crd1140234468 commented 1 year ago

what's the command you run? Currently, model_ad can be loaded into torch.nn.parallel.DistributedDataParallel, but when the model is set to sd-v1-4.ckpt, it cannot be loaded into torch.nn.parallel.DistributedDataParallel, and it will get stuck.