Why w4a8 quantization method have not accelerate the inferrence speed of Stable Diffusion models?

felixslu commented 1 year ago

1、Question on My Nvidia Geforce RTX 3090，I run your w4a8 sd-v1.4 models(sd_w4a8_ckpt-001.pth), got a inference speed of 6 it/s. However, when I run a fp16 sd-v1.5 models by pytorch2.0.1, the inference speed is 20 it/s.

2、My running commands python txt2img.py --prompt "elon musk wearing a suit" --plms --cond --ptq --weight_bit 4 --quant_mode qdiff --no_grad_ckpt --split --n_samples 1 --outdir ./data/ --cali_ckpt ../sd_w4a8_ckpt-001.pth

3、My profile logs

sd_w4a8_ckpt-001.pth

fp16 sd-v1.5 by pytorch2.0.1

lucasjinreal commented 1 year ago

quantization not speed up mostly just loose mem usage imo.,

desenSunUBW commented 1 year ago

Could you share your code of inferencing in fp16? I would like to do some experiments based on it.

Xiuyu-Li / q-diffusion

Why w4a8 quantization method have not accelerate the inferrence speed of Stable Diffusion models? #15