TMElyralab / MuseV

MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising
Other
2.1k stars 214 forks source link

多卡训练失败 #102

Closed wly-ai-bj closed 2 months ago

wly-ai-bj commented 2 months ago

运行指令:

accelerate launch --num_processes=8 train.py 默认参数:--config ./configs/train/musev_referencenet_train_template.yaml

报错信息:

Loaded scheduler as PNDMScheduler fromscheduler` subfolder of ./checkpoints/t2i/sd1.5/fantasticmix_v10. Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 97.66it/s]

self.controlnet=None of type <class 'NoneType'> cannot be saved.

[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=31, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805686 milliseconds before timing out.

Traceback (most recent call last):

File "MuseV_train/train.py", line 2136, in

main(**config)

File "MuseV_train/train.py", line 1407, in main

log_validation(

File "MuseV_train/train.py", line 266, in log_validation

sd_predictor = DiffusersPipelinePredictor(

File "MuseV_train/musev/pipelines/pipeline_controlnet_predictor.py", line 165, in init

controlnet, controlnet_processor, processor_params = load_controlnet_model(

File "MuseV/MMCM/mmcm/vision/feature_extractor/controlnet.py", line 856, in load_controlnet_model

if need_controlnet_processor:

File "MMCM/mmcm/vision/feature_extractor/controlnet.py", line 71, in init

det_ckpt = "checkpoints/yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth"

File "controlnet_aux/src/controlnet_aux/dwpose/init.py", line 141, in init

self.pose_estimation = Wholebody(

File "controlnet_aux/src/controlnet_aux/dwpose/wholebody.py", line 57, in init

self.pose_estimator = init_pose_estimator(

File "miniconda3/envs/musev/lib/python3.10/site-packages/mmpose/apis/inference.py", line 110, in init_model

ckpt = load_checkpoint(model, checkpoint, map_location='cpu')

File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 636, in load_checkpoint

checkpoint = _load_checkpoint(filename, map_location, logger)

File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 548, in _load_checkpoint

return CheckpointLoader.load_checkpoint(filename, map_location, logger)

File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 330, in load_checkpoint

return checkpoint_loader(filename, map_location)

File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 377, in load_from_http

torch.distributed.barrier()

File "miniconda3/envs/musev/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3335, in barrier work.wait()

KeyboardInterrupt

Steps: : 0it [30:32, ?it/s] `