Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Ah- This is probably the result of the fused qkv lora not being applied correctly. This is actually good, since I can use it to test whether my new implementation is correct. So thank you 😆.
The lora is :
https://civitai.com/models/819754/iced-out-diamonds-by-chronoknight-flux
The image generated with flux-fp8-api:
with comfy and flux1-dev:
The prompt is: coca cola can