怎么在多个GPU上运行，没找到参数，只能在第一个GPU图像生成很慢

Tencent / HunyuanDiT

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

https://dit.hunyuan.tencent.com/

Other

3.33k stars 285 forks source link

怎么在多个GPU上运行，没找到参数，只能在第一个GPU图像生成很慢 #36

Open gavid0124 opened 4 months ago

gavid0124 commented 4 months ago

如题。通过CUDA_VISIBLE_DEVICES参数设置多个卡，不生效。运行命令如：CUDA_VISIBLE_DEVICES=1,2 python app/hydit_app.py --no-enhance

Jiangfeng-Xiong commented 4 months ago

你好，生成单张图片目前不支持模型拆分到不同卡上推理

Hello, generating a single image currently does not support splitting the model onto different cards for inference.

flysssss commented 4 months ago

@Jiangfeng-Xiong 你好，目前在V100上1024x1024基本1个step1s，100step差不多100s。有其他加速的方法吗，或者后续出量化版模型？因为V100不支持flashatten加速，如果换到H100上100step，1024x1024分辨率出图能到10s左右吗？

Jiangfeng-Xiong commented 4 months ago

@Jiangfeng-Xiong 你好，目前在V100上1024x1024基本1个step1s，100step差不多100s。有其他加速的方法吗，或者后续出量化版模型？因为V100不支持flashatten加速，如果换到H100上100step，1024x1024分辨率出图能到10s左右吗？

近期会推出蒸馏和trt加速版本模型，可以关注下更新动态。具体速度指标数据会在更新中说明 Distillation and trt accelerated version models will be launched in the near future, please pay attention to the updates. Specific speed indicator data will be explained in the update

gavid0124 commented 4 months ago

你好，生成单张图片目前不支持模型拆分到不同卡上推理

好的，感谢。请问是技术上生成单张图片就无法用多个GPU实现，还是目前不支持，后续可以实现？

zml-ai commented 4 months ago

@Jiangfeng-Xiong 你好，目前在V100上1024x1024基本1个step1s，100step差不多100s。有其他加速的方法吗，或者后续出量化版模型？因为V100不支持flashatten加速，如果换到H100上100step，1024x1024分辨率出图能到10s左右吗？

根据经验，精简到 50 个步骤并转换为 TRT 可使 H800 上的推理速度保持在 5 秒左右。我们很快就会开源。 From experience, distilling down to 50 steps and converting to TRT keeps the inference speed around 5 seconds on the H800. We’ll be going open source pretty soon.