low inference speed 第二步推理异常，一直卡住没有进度

wing158 commented 1 month ago

推理config.json及模型提示放到 MusePose/pretrained_weights/sd-image-variations-diffusers/unet unet,移动unet下后，卡住长时间没有进度

root@153a7e76ceb5:~/MusePose# python test_stage_2.py --config ./configs/test_stage_2.yaml Width: 768 Height: 768 Length: 300 Slice: 48 Overlap: 4 Classifier free guidance: 3.5 DDIM sampling steps : 20 skip 1 Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias'] /usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() handle=== ./assets/images/ref.png ./assets/poses/align/img_ref_video_dance.mp4 pose video has 288 frames, with 24 fps processing length: 144 fps 12 /root/MusePose/musepose/pipelines/pipeline_pose2vid_long.py:406: FutureWarning: Accessing config attribute in_channels directly via 'UNet3DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet3DConditionModel's config object instead, e.g. 'unet.config.in_channels'. num_channels_latents = self.denoising_unet.in_channels 0%| | 0/20 [00:00<?, ?it/s]

wing158 commented 1 month ago

num_channels_latents = self.denoising_unet.in_channels 35%|█████████████████████████████ | 7/20 [1:43:35<3:13:43, 894.13s/it] 24G 3090的速度，能否优化下？像mouseTalk

mikeyimer commented 1 month ago

I am also in this situation. nusepose

czk32611 commented 1 month ago

This is my inference speed on V100

wing158 commented 1 month ago

估计是跟显存24G使用完及GPU占100%有关，这两个资源全占用会使程序会变得特别慢

TZYSJTU commented 1 month ago

Hello, guys! Thanks for your support. Here I would like to give a suggestion that may help you to solve the problem of low inference speed. You could try to reduce the running resolution from 768x768x48 to 512x512x48. It takes 16GB VRAM to run on 512x512x48 It takes 28GB VRAM to run on 768x768x48 ( I don't know why it could run on 768x768x48 using a 3090 with 24GB. It may use a space-time tradeoff which leads to low speed.)

Besides, we also report our running speed: It takes 5 minutes to generate a 10-second video using 512x512x48 resolution on V100. It takes 16 minutes to generate a 10-second video using 768x768x48 resolution on V100. It takes 1 minute to generate a 10-second video using 512x512x48 resolution on H800. It takes 3 minutes to generate a 10-second video using 768x768x48 resolution on H800.

pointave commented 1 month ago

I was getting 360s/it on a 4090 with default size.

Dawgmastah commented 1 month ago

Are there any parameters that could be changed to optimize VRAM a bit so 768x768 fits on 24GB?

wing158 commented 1 month ago

512 param ok

TMElyralab / MusePose

low inference speed 第二步推理异常，一直卡住没有进度 #6