aredden / flux-fp8-api

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Apache License 2.0
202 stars 21 forks source link

add benchmarks numbers for rtx4000ada (non-sff) #14

Closed flowpoint closed 1 month ago

flowpoint commented 2 months ago

benchmarks were run the example from the readme (without api). no init image was used. using an init image seemed to speed up generation on the rtx4000 at 1024x1024 by 0.16 it/s

the config-dev-offload-1-4080.json was used with the following modified keys:

"ae_dtype": "bfloat16", "text_enc_dtype": "bfloat16", "flow_quantization_dtype": "qfloat8", "text_enc_quantization_dtype": "qint4", "ae_quantization_dtype": "qfloat8", "compile_extras": false, "compile_blocks": false, "offload_text_encoder": true, "offload_vae": false, "offload_flow": false

offloading_flow=true strangely caused an out-of memory when generating a second image.

all in all, the 4090 seems about 2.8x faster than the rtx4000ada (non-sff) which is in line with power consumption and other hardware specifications.

aredden commented 1 month ago

Awesome thanks! Hmm- I will have to check out the OOM issue.