aredden / flux-fp8-api

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Apache License 2.0
203 stars 21 forks source link

Hot Lora Replacement #18

Closed Lantianyou closed 3 weeks ago

Lantianyou commented 1 month ago

Currently, seems lora is loaded ahead API server is up. Is there a way to load lora on request, and after this request finished, just unload the lora.

Lantianyou commented 1 month ago

Thanks for your work @aredden

aredden commented 1 month ago

I think this would be awesome, I could work on it, though main issue is that I would need to figure whether merging a lora and then unmerging it would effect the original weights. I will look into it since that would be nice to have as an option.

Lantianyou commented 1 month ago

Thank you for your reply. I will also try to implement it, although I am not an expert in this area

Lantianyou commented 1 month ago

I did some google search, I think PEFT claim they can merge and unmerge lora, but no details explained:

https://discuss.huggingface.co/t/can-i-dynamically-add-or-remove-lora-weights-in-the-transformer-library-like-diffusers/87890

https://stackoverflow.com/questions/78518971/can-i-dynamically-add-or-remove-lora-weights-in-the-transformer-library-like-dif

Lantianyou commented 1 month ago

To unload the lora, I tried to load same lora with scale=-1, but run out of CUDA memory on 4090 24G

Lantianyou commented 1 month ago

I guess to unmerge the lora cleanly, you have to save the original weights first somewhere, but it would introduce performance overhead.

aredden commented 1 month ago

Yeah- that's the problem- you wouldn't want to keep the lora weights in memory- you would want to fuse them into the weights, but if you fuse them into the weights- it could result in degrading the original weights after many weight fuses and unfuses.

Lantianyou commented 1 month ago

True and true

aredden commented 1 month ago

So I implemented it but it's not ready for a push- seems to work well though! Includes loading and unloading, and added a web endpoint for it.

Lantianyou commented 1 month ago

Would you mind pushing the code to a different branch, so I can test it?

aredden commented 1 month ago

Alright I pushed to 'removable-lora' https://github.com/aredden/flux-fp8-api/tree/removable-lora - you can test it if you want- though it's currently not in the webapi, would have to test it via a script @Lantianyou

Lantianyou commented 1 month ago

Thank you a lot, will get back you the results

81549361 commented 1 month ago

Alright I pushed to 'removable-lora' https://github.com/aredden/flux-fp8-api/tree/removable-lora - you can test it if you want- though it's currently not in the webapi, would have to test it via a script @Lantianyou

I tested this branch and found that when uninstalling lora on a single card 4090, OOM would occur.

81549361 commented 1 month ago

@aredden I can successfully uninstall Lora immediately after loading it, but if I uninstall it after performing an inference, OOM will occur.

aredden commented 1 month ago

Ah- I guess it might need some work with cleaning up the loras after unloading / unloading. I will work on this, thanks @81549361

81549361 commented 1 month ago

Ah- I guess it might need some work with cleaning up the loras after unloading / unloading. I will work on this, thanks @81549361

Thank you very much, your repo is awesome!

aredden commented 1 month ago

Alright I merged it into the main branch