Nota-NetsPresso / BK-SDM

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ECCV'24]
Other
238 stars 16 forks source link

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline. #53

Closed StormArcher closed 6 months ago

StormArcher commented 6 months ago

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline. very thank you!

bokyeong1015 commented 6 months ago

Hi, we've added the code and please run:

pip install thop==0.1.1.post2209072238
python src/count_macs_params.py
Click for the results: == CompVis/stable-diffusion-v1-4 | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 338.7G = 338749194240 [U-Net] Params: 859.5M = 859520964 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1585.4G = 1585374608384 [Total] Params: 1032.1M = 1032071623 == nota-ai/bk-sdm-base | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 223.8G = 223755632640 [U-Net] Params: 579.4M = 579384964 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1470.4G = 1470381046784 [Total] Params: 751.9M = 751935623 == nota-ai/bk-sdm-small | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 217.7G = 217727959040 [U-Net] Params: 482.3M = 482346884 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1464.4G = 1464353373184 [Total] Params: 654.9M = 654897543 == nota-ai/bk-sdm-tiny | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 205.0G = 205035274240 [U-Net] Params: 323.4M = 323384964 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1451.7G = 1451660688384 [Total] Params: 495.9M = 495935623 == runwayml/stable-diffusion-v1-5 | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 338.7G = 338749194240 [U-Net] Params: 859.5M = 859520964 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1585.4G = 1585374608384 [Total] Params: 1032.1M = 1032071623 == stabilityai/stable-diffusion-2-1-base | 512x512 img generation == [Text Enc] MACs: 22.3G = 22299160576 [Text Enc] Params: 340.4M = 340387840 [U-Net] MACs: 339.2G = 339241205760 [U-Net] Params: 865.9M = 865910724 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1601.6G = 1601619898368 [Total] Params: 1255.8M = 1255788743 == stabilityai/stable-diffusion-2-1 | 768x768 img generation == [Text Enc] MACs: 22.3G = 22299160576 [Text Enc] Params: 340.4M = 340387840 [U-Net] MACs: 760.8G = 760797839360 [U-Net] Params: 865.9M = 865910724 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 2023.2G = 2023176531968 [Total] Params: 1255.8M = 1255788743
StormArcher commented 6 months ago

We followed the author code for testing, and the experimental results show that the results of MACs differ too much from the paper. Is there a problem writing code in this?

== CompVis/stable-diffusion-v1-4 | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 0.2G = 232980480 [U-Net] Params: 859.5M = 859520964 [Img Dec] MACs: 1.0G = 981467136 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 7.8G = 7760329728 [Total] Params: 1032.1M = 1032071623

bokyeong1015 commented 6 months ago

We obtained the results below with the refactored and uploaded code, which are identical to those presented in our paper.

== CompVis/stable-diffusion-v1-4 | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 338.7G = 338749194240 [U-Net] Params: 859.5M = 859520964 [Img Dec] MACs: 1240.1G = 1240079532032 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 1585.4G = 1585374608384 [Total] Params: 1032.1M = 1032071623

Could you share the output of pip show thop to check the version and ensure that pip install thop==0.1.1.post2209072238? If you could share the exact procedure or code that you've run, it would help us in reproducing your issue.

StormArcher commented 6 months ago
  1. I ran your code directly, just for "CompVis/stable-diffusion-v1-4",as follows: get_macs_params(model_id="CompVis/stable-diffusion-v1-4", img_size=512, txt_emb_size=768, device=device)

2 . the version is same as you. import thop print(thop.version) 0.1.1

  1. Why is running step 1? batch_size=1 dummy_timesteps = torch.zeros(batch_size).to(device)

How to control the sample step?

The code as follows:

@@ -0,0 +1,62 @@

------------------------------------------------------------------------------------

Copyright 2024. Nota Inc. All Rights Reserved.

------------------------------------------------------------------------------------

import torch from diffusers import StableDiffusionPipeline from thop import profile

def count_params(model): return sum(p.numel() for p in model.parameters())

def get_macs_params(model_id, img_size=512, txt_emb_size=768, device="cuda", batch_size=1): pipeline = StableDiffusionPipeline.from_pretrained(model_id).to(device) text_encoder = pipeline.text_encoder unet = pipeline.unet vae_decoder = pipeline.vae.decoder

# text encoder    
dummy_input_ids = torch.zeros(batch_size, 77).long().to(device)  # (1,77)
macs_txt_enc, _ = profile(text_encoder, inputs=(dummy_input_ids,))  # (1,77)
macs_txt_enc = macs_txt_enc/batch_size
params_txt_enc = count_params(text_encoder)

# unet
dummy_noisy_latents = torch.zeros(batch_size, 4, int(img_size/8), int(img_size/8)).to(device)  # (1, 4, 512/8, 512/8) = (1, 4, 64, 64)
dummy_timesteps = torch.zeros(batch_size).to(device)  # 1
dummy_text_emb = torch.zeros(batch_size, 77, txt_emb_size).to(device)  # (1, 77, 768)
# key (1, 4, 64, 64)(1)(1, 77, 768)
macs_unet, _ = profile(unet, inputs= (dummy_noisy_latents, dummy_timesteps, dummy_text_emb))  # (1, 4, 64, 64) (1) (1, 77, 768)
macs_unet = macs_unet/batch_size
params_unet = count_params(unet)

# image decoder
dummy_latents = torch.zeros(batch_size, 4, 64, 64).to(device)  # (1, 4, 64, 64)
macs_img_dec, _ = profile(vae_decoder, inputs= (dummy_latents,))
macs_img_dec = macs_img_dec/batch_size
params_img_dec = count_params(vae_decoder)

# total
macs_total = macs_txt_enc+macs_unet+macs_img_dec
params_total = params_txt_enc+params_unet+params_img_dec

# print
print(f"== {model_id} | {img_size}x{img_size} img generation ==")
print(f"  [Text Enc] MACs: {(macs_txt_enc/1e9):.1f}G = {int(macs_txt_enc)}")
print(f"  [Text Enc] Params: {(params_txt_enc/1e6):.1f}M = {int(params_txt_enc)}")
print(f"  [U-Net] MACs: {(macs_unet/1e9):.1f}G = {int(macs_unet)}")
print(f"  [U-Net] Params: {(params_unet/1e6):.1f}M = {int(params_unet)}")
print(f"  [Img Dec] MACs: {(macs_img_dec/1e9):.1f}G = {int(macs_img_dec)}")
print(f"  [Img Dec] Params: {(params_img_dec/1e6):.1f}M = {int(params_img_dec)}")    
print(f"  [Total] MACs: {(macs_total/1e9):.1f}G = {int(macs_total)}")
print(f"  [Total] Params: {(params_total/1e6):.1f}M = {int(params_total)}")

if name == "main":
device="cuda:0" get_macs_params(model_id="CompVis/stable-diffusion-v1-4", img_size=512, txt_emb_size=768, device=device)

get_macs_params(model_id="nota-ai/bk-sdm-base", img_size=512, txt_emb_size=768, device=device)

# get_macs_params(model_id="nota-ai/bk-sdm-small", img_size=512, txt_emb_size=768, device=device)
# get_macs_params(model_id="nota-ai/bk-sdm-tiny", img_size=512, txt_emb_size=768, device=device)    

# get_macs_params(model_id="runwayml/stable-diffusion-v1-5", img_size=512, txt_emb_size=768, device=device)
# get_macs_params(model_id="stabilityai/stable-diffusion-2-1-base", img_size=512, txt_emb_size=1024, device=device)
# get_macs_params(model_id="stabilityai/stable-diffusion-2-1", img_size=768, txt_emb_size=1024, device=device)
bokyeong1015 commented 6 months ago

1 - Thanks for checking. 3 - Do you mean how to control the number of denoising steps? We calculated MACs for a single denoising step and then multiplied it with the total number of steps, which is 25. dummy_timesteps is a single scalar value for the timestep index.


2 - Umm, when we did pip install thop==0.1.1.post2209072238, we obtained the below log with your code. Please share:

the log we obtained

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.dropout.Dropout'>.
[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.dropout.Dropout'>.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
== CompVis/stable-diffusion-v1-4 | 512x512 img generation ==
  [Text Enc] MACs: 6.5G = 6545882112
  [Text Enc] Params: 123.1M = 123060480
  [U-Net] MACs: 338.7G = 338749194240
  [U-Net] Params: 859.5M = 859520964
  [Img Dec] MACs: 1240.1G = 1240079532032
  [Img Dec] Params: 49.5M = 49490179
  [Total] MACs: 1585.4G = 1585374608384
  [Total] Params: 1032.1M = 1032071623

pip show thop 0.1.1.post2209072238

Name: thop
Version: 0.1.1.post2209072238
Summary: A tool to count the FLOPs of PyTorch model.
Home-page: https://github.com/Lyken17/pytorch-OpCounter/

pip show diffusers 0.15.0

Name: diffusers
Version: 0.15.0
Summary: Diffusers
Home-page: https://github.com/huggingface/diffusers
StormArcher commented 6 months ago
  1. I think the author may have forgotten to add “/8” for img_size of input of UNet, so the computational complexity of unet is (339G), if "img_size/8" should be the same as mine (0.2G)

if img_size/8, is 0.2G if img_size/4, is 0.9G if img_size/2, is 3.7G

  1. I think the version of thop and diffuers is same with you, but the MACs og my UNet for "runwayml/stable-diffusion-v1-5" is

-> the output of pip show thop Name: thop Version: 0.1.1.post2209072238 Summary: A tool to count the FLOPs of PyTorch model. Home-page: https://github.com/Lyken17/pytorch-OpCounter/ Author: Ligeng Zhu Author-email: ligeng.zhu+github@gmail.com License: MIT Location: /opt/conda/lib/python3.8/site-packages Requires: torch Required-by:

-> the output of pip show diffusers Name: diffusers Version: 0.15.0.dev0 Summary: Diffusers Home-page: https://github.com/huggingface/diffusers Author: The HuggingFace team Author-email: patrick@huggingface.co License: Apache Location: /opt/conda/lib/python3.8/site-packages Editable project location: /home/pansiyuan/.jupyter/diffusers Requires: filelock, huggingface-hub, importlib-metadata, numpy, Pillow, regex, requests Required-by:

== CompVis/stable-diffusion-v1-4 | 512x512 img generation == [Text Enc] MACs: 6.5G = 6545882112 [Text Enc] Params: 123.1M = 123060480 [U-Net] MACs: 0.2G = 232980480 [U-Net] Params: 859.5M = 859520964 [Img Dec] MACs: 1.0G = 981467136 [Img Dec] Params: 49.5M = 49490179 [Total] MACs: 7.8G = 7760329728 [Total] Params: 1032.1M = 1032071623

bokyeong1015 commented 6 months ago

I think the author may have forgotten to add “/8” for img_size of input of UNet, so the computational complexity of unet is (339G), if "img_size/8" should be the same as mine (0.2G)

We didn't get this point. We think that the division by 8 ("/8") is correctly considered in our code.

dummy_noisy_latents = torch.zeros(batch_size, 4, int(img_size/8), int(img_size/8)).to(device)

Futhermore, if we changed img_size as you mentioned, the following results were obtained, and we were not able to reproduce if img_size/8, is 0.2G:

# get_macs_params(model_id="CompVis/stable-diffusion-v1-4", img_size=512, txt_emb_size=768, device=device)

== CompVis/stable-diffusion-v1-4 | 512x512 img generation ==
  [U-Net] MACs: 338.7G = 338749194240
  [U-Net] Params: 859.5M = 859520964
# get_macs_params(model_id="CompVis/stable-diffusion-v1-4", img_size=64, txt_emb_size=768, device=device)

== CompVis/stable-diffusion-v1-4 | 64x64 img generation ==
  [U-Net] MACs: 6.8G = 6773345280
  [U-Net] Params: 859.5M = 859520964

Thanks for your check. Unfortunately, we are unsure if we can provide further assistance at this moment, as we were not able to reproduce the issue you described.