Questions on GGU Q8 Model Performance: T5_FP16 vs T5_Q8 and Clip Usage

FerreiraArmando commented 3 weeks ago

Hello everyone,

First off, a big thanks to city96 for the awesome work they've been contributing to the community. It's been incredibly helpful!

Here are my system specs: Processor: Intel i5-13400 GPU: NVIDIA RTX 4060 Ti 8GB RAM: 64GB DDR4 Operating System: Windows 11

I've been experimenting with the GGU Q8 models, toggling between T5_FP16 and T5_Q8 configurations. Here are my render times:

Q8 Dev: Resolution: 720x1280 Steps: 20 Sampler: Euler Scheduler: Beta Model: T5_FP16 Clip l Render Time: 85 seconds

Q8 Schnell: Resolution: 720x1280 Steps: 4 Sampler: Euler Scheduler: Simple Model: T5_FP16 Clip l Render Time: 16 seconds

A couple of questions that came up during my tests:

1 - Performance Differences: I noticed better speeds with T5_FP16—averaging 90 seconds for Dev and 18 seconds for Schnell. Switching to T5_Q8 bumped the times to around 100 seconds for Dev and 20 seconds for Schnell. Am I missing something in my workflow here?

2 - Using Clip l: Is Clip l the recommended choice always, or are there other options I should consider? Also, does the order of clips impact the results significantly?

Thanks for taking the time to read this! I know my questions might be basic, but I hope they can help others in the community too. I've attached the workflow file I used for reference.

Cheers!

Captura de tela 2024-08-24 090531 Flux - GGUF - Clean.json

Foul-Tarnished commented 3 weeks ago

GGUF will make things a bit slower, it's expected, it's a compressed format and it needs to do more stuff to load it

Quality comparison from me: Comparison of T5 (all on Flux base model as Q6_K) flux-comparison Euler, beta scheduler, 40 steps

Q6_K looks worse than FP8 for me (too different) Q8 is quite good, but the hairstyle is not the same (bangs hair)

I also tested on Q8 model and results were similar. It seems T5 gets more impacted by gguf compression than the base model. I would only use Q8 for T5, while Q6_K is great for the base model

city96 commented 2 weeks ago

@FerreiraArmando

Performance Differences

This is expected, GGUF quantization will most likely always be slower compared to FP16 or FP8 as dequantizing takes more compute. The current solution is also not as optimized as it could be.

Is Clip l the recommended choice always, or are there other options I should consider? Also, does the order of clips impact the results significantly?

Yes, clip-l (officially named openai/clip-vit-large-patch14) is still the recommended model as that's what Flux was trained with. You can also try out CLIP-GmP-ViT-L-14 which has been a popular choice for the community lately.

@Foul-Tarnished

Q6_K looks worse than FP8 for me (too different)

I would argue that Q6_K is still closer. The painting in the background, chair and clothes are correct while they're different with FP8. Technically using an imatrix would improve this and allow lower bit quants to remain more faithful as well though this isn't supported in llama.cpp for T5 enc/dec models for now.

I would only use Q8 for T5, while Q6_K is great for the base model

I agree, although depending on how much VRAM/system RAM is present that may not always be an option.

city96 / ComfyUI-GGUF

Questions on GGU Q8 Model Performance: T5_FP16 vs T5_Q8 and Clip Usage #68