Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Apache License 2.0
202
stars
21
forks
source link
add benchmarks numbers for rtx4000ada (non-sff) #14
benchmarks were run the example from the readme (without api). no init image was used. using an init image seemed to speed up generation on the rtx4000 at 1024x1024 by 0.16 it/s
the config-dev-offload-1-4080.json was used with the following modified keys:
benchmarks were run the example from the readme (without api). no init image was used. using an init image seemed to speed up generation on the rtx4000 at 1024x1024 by 0.16 it/s
the config-dev-offload-1-4080.json was used with the following modified keys:
"ae_dtype": "bfloat16", "text_enc_dtype": "bfloat16", "flow_quantization_dtype": "qfloat8", "text_enc_quantization_dtype": "qint4", "ae_quantization_dtype": "qfloat8", "compile_extras": false, "compile_blocks": false, "offload_text_encoder": true, "offload_vae": false, "offload_flow": false
offloading_flow=true strangely caused an out-of memory when generating a second image.
all in all, the 4090 seems about 2.8x faster than the rtx4000ada (non-sff) which is in line with power consumption and other hardware specifications.