Quantization every time?

I am experimenting with your main.py for Flux with LoRAs. Command line Windows, using your example command.

python main.py --prompt "A cute corgi lives in a house made out of sushi, anime" --lora_repo_id XLabs-AI/flux-lora-collection --lora_name anime_lora.safetensors --device cuda --offload --use_lora --model_type flux-dev-fp8 --width 1024 --height 1024

The step Start a quantization process... seems to be needed every run? This adds around 2m30s to each image. Is it possible for you to change this so after the first quantization the model is saved then loaded for future runs to be much faster. For example, this is how I do with the "flux on potato" code to only do the quantization once. This is for that script, but you get the idea.

if os.path.isfile('models\\transformer.pt'):
    transformer = torch.load('models\\transformer.pt')
    transformer.eval()
else:
    transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)
    quantize(transformer, weights=qfloat8)
    freeze(transformer)
    torch.save(transformer, 'models\\transformer.pt')

Without that tweak it is currently taking around 3m30s to run per image on a 24 GB 4090. Anything else that can be done to speed it up?

Thanks for any tips/ideas.

XLabs-AI / x-flux

Quantization every time? #70