Inference is quite slow on A10 - Githubissues

PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

https://pixart-alpha.github.io/PixArt-sigma-project/

GNU Affero General Public License v3.0

1.44k stars 67 forks source link

Inference is quite slow on A10 #92

Open souvikqb opened 1 month ago

souvikqb commented 1 month ago

Hi the model is amazing to use but the inference speed is quite slow on an A10 GPU. I saw decent performance on A100 though.

Is there any optimisation method I can apply to speed it up ?

lawrence-cj commented 1 month ago

is the slow inference speed due to the diffusion model or the others? BTW, you should share your inference code for me to reference.

souvikqb commented 1 month ago

is the slow inference speed due to the diffusion model or the others? BTW, you should share your inference code for me to reference.

I'm directly using this code -

import torch
from diffusers import Transformer2DModel, PixArtSigmaPipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
weight_dtype = torch.float16

transformer = Transformer2DModel.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", 
    subfolder='transformer', 
    torch_dtype=weight_dtype,
    use_safetensors=True,
)
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
    transformer=transformer,
    torch_dtype=weight_dtype,
    use_safetensors=True,
)
pipe.to(device)

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
image.save("./catcus.png")

lawrence-cj commented 1 month ago

Cool. And what about the inference speed?

souvikqb commented 1 month ago

Cool. And what about the inference speed?

On an A10, this is the avg inference speed I'm getting for various sample/step combinations -

lawrence-cj commented 1 month ago

It's strange since I can generate within 8 seconds even with V100 GPUs. Is the V100 stronger than A10?

souvikqb commented 1 month ago

It's strange since I can generate within 8 seconds even with V100 GPUs. Is the V100 stronger than A10?

Can you share the code that you are using? Also the hyperparameter configuration
Can we reduce it further than 8s? I'm looking for something in the range of 3-4s. Cause I optimised SDXL with Deepcache to get that performance

lawrence-cj commented 1 month ago

The inference code is the same as yours. I'd suspect if the env is different or else. I haven't tried on the Deepcache until now. If possible, I would really appreciate it if you could Pull a request.

souvikqb commented 1 month ago

The inference code is the same as yours. I'd suspect if the env is different or else. I haven't tried on the Deepcache until now. If possible, I would really appreciate it if you could Pull a request.

Unfortunately, Deepcache isn't supported for Pix-Art-Sigma it seems, and only for Stable Diffusion models. I'm still linking it here -

Would definitely appreciate a more optimised version of Pix-Art-Sigma cause the image quality is really superior.

souvikqb commented 1 month ago

The inference code is the same as yours. I'd suspect if the env is different or else. I haven't tried on the Deepcache until now. If possible, I would really appreciate it if you could Pull a request.

Unfortunately, Deepcache isn't supported for Pix-Art-Sigma it seems, and only for Stable Diffusion models. I'm still linking it here -

https://www.reddit.com/r/StableDiffusion/comments/18b40hh/deepcache_accelerating_diffusion_models_for_free/

https://github.com/horseee/DeepCache

Would definitely appreciate a more optimised version of Pix-Art-Sigma cause the image quality is really superior.

@lawrence-cj Did you get to try any optimisation techniques?

lawrence-cj commented 1 month ago

I haven't.

GavChap commented 3 weeks ago

https://github.com/horseee/learning-to-cache/tree/main this looks pretty interesting.