Why this quantization model need more than 24GB GPU memory which is larger than ideal 500M?

felixslu commented 1 year ago

1、Questions

As we Known, SD v1.5 has 1 Billions params , and it's peek GPU memory is about 4G at the precison fp32. So, the memory of int4 precison (sd_w4a8_chpt.pth) will be about 4G/8 = 500MB. However, when I load and run your w4a8 quantization models , the consumed GPU memory is more than 24GB, and we got a OOM finnaly!

2、my commands:

python txt2img.py --prompt "a puppet wearing a hat" --plms --cond --ptq --weight_bit 4 --quant_mode qdiff --no_grad_ckpt --split --n_samples 5 --quant_act --act_bit 8 --sm_abit 16 --outdir ./data/ --cali_ckpt ../sd_w4a8_ckpt-001.pth

3、Error Logs：

07/31/2023 11:16:03 - INFO - root - Loading model from models/ldm/stable-diffusion-v1/model.ckpt 07/31/2023 11:16:04 - INFO - root - Global Step: 470000 07/31/2023 11:16:04 - INFO - torch.distributed.nn.jit.instantiator - Created a temporary directory at /tmp/tmpmwfx988m 07/31/2023 11:16:04 - INFO - torch.distributed.nn.jit.instantiator - Writing /tmp/tmpmwfx988m/_remote_module_non_scriptable.py LatentDiffusion: Running in eps-prediction mode 07/31/2023 11:16:07 - INFO - ldm.util - DiffusionWrapper has 859.52 M params. 07/31/2023 11:16:07 - INFO - ldm.modules.diffusionmodules.model - making attention of type 'vanilla' with 512 in_channels 07/31/2023 11:16:07 - INFO - ldm.modules.diffusionmodules.model - Working with z of shape (1, 4, 32, 32) = 4096 dimensions. 07/31/2023 11:16:07 - INFO - ldm.modules.diffusionmodules.model - making attention of type 'vanilla' with 512 in_channels 07/31/2023 11:16:12 - INFO - main - Not use gradient checkpointing for transformer blocks Loading quantized model checkpoint Initializing weight quantization parameters 07/31/2023 11:16:27 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:28 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:28 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:29 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:32 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:34 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:37 - INFO - qdiff.quant_layer - split at 1280! 07/31/2023 11:16:38 - INFO - qdiff.quant_layer - split at 640! 07/31/2023 11:16:39 - INFO - qdiff.quant_layer - split at 640! 07/31/2023 11:16:40 - INFO - qdiff.quant_layer - split at 640! 07/31/2023 11:16:41 - INFO - qdiff.quant_layer - split at 320! 07/31/2023 11:16:42 - INFO - qdiff.quant_layer - split at 320! Initializing act quantization parameters Traceback (most recent call last): File "txt2img.py", line 444, in main() File "txt2img.py", line 340, in main resume_cali_model(qnn, opt.cali_ckpt, cali_data, opt.quant_act, "qdiff", cond=opt.cond) File "/home/xx/car/bigmodel/q-diffusion/qdiff/utils.py", line 86, in resume_calimodel = qnn(cali_xs.cuda(), cali_ts.cuda(), cali_cs.cuda()) ... ... File "/root/miniconda3/envs/qdiff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/xx/car/bigmodel/q-diffusion/qdiff/adaptive_rounding.py", line 59, in forward x_float_q = (x_quant - self.zero_point) self.delta torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.69 GiB total capacity; 23.21 GiB already allocated; 11.69 MiB free; 23.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

tsa18 commented 1 year ago

32G V100 also gets OOM

arman-kazemi commented 1 year ago

40G A100 also OOM. Likely because of the fake quantization operations which require their own intermediate tensors allocated. You can reduce "n_samples" to counter this. For example n_samples=1 only needs 20GB.

cvv-student commented 12 months ago

For me, it's not effective. Even with n_samples=1 on 32GB V100, it still leads to OOM. This is the launch script: python scripts/txt2img.py --prompt "a puppet wearing a hat" --plms --cond --ptq --weight_bit 4 --quant_mode qdiff --no_grad_ckpt --split --n_samples 1 --quant_act --act_bit 8 --sm_abit 16 --outdir ./data/ --cali_ckpt models/sd_w4a8.pth --resume

Yheechou commented 5 months ago

For me, it's not effective. Even with n_samples=1 on 32GB V100, it still leads to OOM. This is the launch script: python scripts/txt2img.py --prompt "a puppet wearing a hat" --plms --cond --ptq --weight_bit 4 --quant_mode qdiff --no_grad_ckpt --split --n_samples 1 --quant_act --act_bit 8 --sm_abit 16 --outdir ./data/ --cali_ckpt models/sd_w4a8.pth --resume

I also encountered this problem. May I know how you solve it finally

Xiuyu-Li / q-diffusion