aredden / flux-fp8-api

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Apache License 2.0
198 stars 18 forks source link

The speed of drawing is not satisfactory #26

Open lvjin521 opened 2 weeks ago

lvjin521 commented 2 weeks ago
image image image

I have encountered a new problem. I successfully built a project using 4090 on the runpod platform, but it did not make my graph generation speed twice faster, but the speed of 7000 milliseconds as the original project. Please tell me the reason, I can't solve this problem. Thank you very much.

Viper373 commented 2 weeks ago

image image image I have encountered a new problem. I successfully built a project using 4090 on the runpod platform, but it did not make my graph generation speed twice faster, but the speed of 7000 milliseconds as the original project. Please tell me the reason, I can't solve this problem. Thank you very much.

I have also encountered the same problem and hope to receive a reply as soon as possible!!!

aredden commented 2 weeks ago

I am a bit confused. Where are you getting 3.32 iterations per second? Total generation time doesn't mean as much as the it/s speed. You also need to take into account the image size and the number of steps you decide to generate with 🤔

lvjin521 commented 2 weeks ago

{ "prompt": "A detailed and adorable illustration of a small dog. The dog should be fluffy with big, expressive eyes, floppy ears, and a playful expression. It should be sitting on the ground with its tail wagging slightly, surrounded by a warm, cozy environment that enhances the cuteness of the scene. The colors should be soft and gentle, with warm lighting that makes the dog look even more endearing.", "width": 1024, "height": 1024, "num_steps": 24, "guidance": 3.5, "seed": 2 }

This is the generative parameter I used, and his final speed is 7000 ms, not 300 ms

aredden commented 2 weeks ago

The speeds you are getting look normal to me. The model does 3.32 forward passes per second which is relatively close to max tflops for a 4090 if you're generating an image at 1024x1024. If you want more speed you can shrink the size of the image, setting height to less than 1024, or width to less than 1024. Or you can use schnell which allows you to generate an image in 4 steps instead of 24 steps, at a bit less quality. Other things you can do is try a flux hyper lora which allows you to reduce the number of steps to ~8 steps. The speeds that H100's get are very different from the speeds that you get with a 4090. H100's max tflops for fp8 is absolutely gigantic, around 1500-2000 tflops, vs a 4090 which "only" (still a lot) gets ~330 tflops with fp8.