Closed wly-ai-bj closed 2 months ago
运行指令:
accelerate launch --num_processes=8 train.py 默认参数:--config ./configs/train/musev_referencenet_train_template.yaml
accelerate launch --num_processes=8 train.py
报错信息:
Loaded scheduler as PNDMScheduler fromscheduler` subfolder of ./checkpoints/t2i/sd1.5/fantasticmix_v10. Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 97.66it/s]
Loaded scheduler as PNDMScheduler from
self.controlnet=None of type <class 'NoneType'> cannot be saved.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=31, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805686 milliseconds before timing out.
Traceback (most recent call last):
File "MuseV_train/train.py", line 2136, in
main(**config)
File "MuseV_train/train.py", line 1407, in main
log_validation(
File "MuseV_train/train.py", line 266, in log_validation
sd_predictor = DiffusersPipelinePredictor(
File "MuseV_train/musev/pipelines/pipeline_controlnet_predictor.py", line 165, in init
controlnet, controlnet_processor, processor_params = load_controlnet_model(
File "MuseV/MMCM/mmcm/vision/feature_extractor/controlnet.py", line 856, in load_controlnet_model
if need_controlnet_processor:
File "MMCM/mmcm/vision/feature_extractor/controlnet.py", line 71, in init
det_ckpt = "checkpoints/yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth"
File "controlnet_aux/src/controlnet_aux/dwpose/init.py", line 141, in init
self.pose_estimation = Wholebody(
File "controlnet_aux/src/controlnet_aux/dwpose/wholebody.py", line 57, in init
self.pose_estimator = init_pose_estimator(
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmpose/apis/inference.py", line 110, in init_model
ckpt = load_checkpoint(model, checkpoint, map_location='cpu')
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 636, in load_checkpoint
checkpoint = _load_checkpoint(filename, map_location, logger)
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 548, in _load_checkpoint
return CheckpointLoader.load_checkpoint(filename, map_location, logger)
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 330, in load_checkpoint
return checkpoint_loader(filename, map_location)
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 377, in load_from_http
torch.distributed.barrier()
File "miniconda3/envs/musev/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3335, in barrier work.wait()
KeyboardInterrupt
Steps: : 0it [30:32, ?it/s] `
运行指令:
accelerate launch --num_processes=8 train.py
默认参数:--config ./configs/train/musev_referencenet_train_template.yaml报错信息:
Loaded scheduler as PNDMScheduler from
scheduler` subfolder of ./checkpoints/t2i/sd1.5/fantasticmix_v10. Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 97.66it/s]self.controlnet=None of type <class 'NoneType'> cannot be saved.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=31, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805686 milliseconds before timing out.
Traceback (most recent call last):
File "MuseV_train/train.py", line 2136, in
File "MuseV_train/train.py", line 1407, in main
File "MuseV_train/train.py", line 266, in log_validation
File "MuseV_train/musev/pipelines/pipeline_controlnet_predictor.py", line 165, in init
File "MuseV/MMCM/mmcm/vision/feature_extractor/controlnet.py", line 856, in load_controlnet_model
File "MMCM/mmcm/vision/feature_extractor/controlnet.py", line 71, in init
File "controlnet_aux/src/controlnet_aux/dwpose/init.py", line 141, in init
File "controlnet_aux/src/controlnet_aux/dwpose/wholebody.py", line 57, in init
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmpose/apis/inference.py", line 110, in init_model
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 636, in load_checkpoint
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 548, in _load_checkpoint
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 330, in load_checkpoint
File "miniconda3/envs/musev/lib/python3.10/site-packages/mmengine/runner/checkpoint.py", line 377, in load_from_http
File "miniconda3/envs/musev/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3335, in barrier work.wait()
KeyboardInterrupt
Steps: : 0it [30:32, ?it/s] `