Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.69k stars 140 forks source link

[Feat]: FLUX-1 Support #418

Open Tophness opened 2 months ago

Tophness commented 2 months ago

Describe your use-case.

FLUX-1 Support

What would you like to see as a solution?

FLUX-1 Support

Have you considered alternatives? List them here.

I could go fuck myself I guess

silverace71 commented 2 months ago

Amazing alternative lol.

FurkanGozukara commented 2 months ago

my biggest hope is from onetrainer for low vram FLUX-1

SDXL is lowest vram trained on OneTrainer so many people are waiting atm

etha302 commented 2 months ago

I guess full fine tuning will be harder to implement, but lora support would already be amazing.

djp3k05 commented 2 months ago

that would be nice to have Flux in OT.... seems that FLUX is the best open source model, way over than what we have now.

Desm0nt commented 2 months ago

Lora training (even not a low VRAM, fp8-quanto with 23.5 also good enough) for windows users would be amazing (SimpleTuner not working even on WSL, only rented linux env)

yggdrasil75 commented 2 months ago

I think that onetrainer might finally need to implement multigpu for full flux support, because anything over rank 16 lora will probably be untrainable with the 24gb that consumer gpus typically have. which is probably actually a good thing. but then again, with how great a model flux is, we might be pushing rank 4 loras for simple stuff cause you dont need to finetune much, its probably already good at what you want.

ejektaflex commented 2 months ago

anything over rank 16 lora will probably be untrainable with the 24gb that consumer gpus typically have.

On fp8, I can either train rank 2 batch 1 or rank 1 batch 2. On quanto-int4, I can manage rank 2 batch 2 I think (though, AFAIK, Flux quantized to int4 is not available for inference on Comfy, only in Python code). Flux LoRa training capability on a 24gb card is rather limited and I've yet to have any successful results after doing an LR sweep.

Desm0nt commented 2 months ago

Flux LoRa training capability on a 24gb card is rather limited

I run rank-16 batch 1 gradient accumulation 3 on single 4090 in 1024*1024 in fp8-quanto in an attempt to train Dora. So, it can run rank 16. But the results are... highly questionable. It took 15 hours on 700 images (on which I had previously successfully train Style Lora for sd1.5, sdxl and pixart sigma) and with a high LR of 2e-5 it is both very undertrained and overcoocked. With a lower LR it will take even longer to train, with a higher LR it will forget everything it knew before.

VeteranXT commented 1 month ago

It took 15 hours on 700 images. I had 11 and it took me 45-75mins. Geez!

stealurfaces commented 1 month ago

i like to fine tune models on a handful of datasets (100-3000 images each). would be really nice to have SDXL-like support for that in OT (because it works so well!)

VeteranXT commented 1 month ago

My question how did you get 1000? Let alone 3000?

Tophness commented 1 month ago

Supposedly NF4 gets up to 4x speedup, less vram for even higher precision and is now the recommended format. Seems like this might actually be doable now?

https://civitai.com/models/638572/nf4-flux1 https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/981

Nisekoixmy commented 1 month ago

Supposedly NF4 gets up to 4x speedup, less vram for even higher precision and is now the recommended format. Seems like this might actually be doable now?

https://civitai.com/models/638572/nf4-flux1 lllyasviel/stable-diffusion-webui-forge#981

NF4 might be good for inference but may not be the optimal format for training. In general, training should under a more precise data type so that the model could learn things well.

dathide commented 1 month ago

There's a guide here by the SimpleTuner devs for creating FLUX Loras. Edit: They recommend 512px training over 1024px

driqeks commented 1 month ago

is Flux training coming to OneTrainer? It seems to work great in SimpleTuner

Black911X commented 1 month ago

It would be great if could train with 12GB VRAM

Tophness commented 1 month ago

There's a guide here by the SimpleTuner devs for creating FLUX Loras. Edit: They recommend 512px training over 1024px

It would be great if could train with 12GB VRAM

They have this running great on 12GB and 16GB on kohya's. Simpletuner recommends A100s and people seem to be saying their implementation is not very efficient. https://github.com/kohya-ss/sd-scripts/issues/1445#issuecomment-2291817718

yggdrasil75 commented 1 month ago

My question how did you get 1000? Let alone 3000?

flux lora dataset, like any lora dataset is whatever you want. you can pull all images from a subreddit you like, download an artists entire artstation, rip every frame from a video, whatever. then you just use something like taggui to generate captions or tags for the images (if you didnt pull from a booru, this is useful) and you have 3000+ images. currently I have ~3400 images from roughly 100 different artists for 40 different characters from 2 different games. when I downloaded them from a booru I got 11k images and deleted the majority because they were poorly drawn, not accurate to the character (gender swaps, race swaps, and even species swaps, like "catlike girl as an actual cat" type stuff), or simply not something I am interested in training on (comics mostly), or stuff I am not interested in at all.

if you want a tool to help, look at gallery-dl. if you are pulling from a booru, I would recommend putting something to separate the description from the tags in your conf.

gilga2024 commented 1 month ago

Any news on this? Just read about the Flux branch and checked it, but found no other info. Any way one can help / provide input via testing?

mx commented 1 month ago

News will be provided when things are ready.

gilga2024 commented 1 month ago

It seems like the flux feature is now available in master. At least for LoRA and DoRA. May I kindly ask if there is any documentation / update on how to use it? I have not found any information on this so far.

master131 commented 1 month ago

Trained a test LoRA and it appears that the key names in the produced safetensors do not match what is being used for most LoRAs that can be loaded into WebUI Forge.

For example, OneTrainer: lora_transformer_single_transformer_blocks_10_attn_to_k.lora_up.weight

And other models (ie. via SimpleTrainer): transformer.single_transformer_blocks.10.attn.to_k.lora_A.weight

MNeMoNiCuZ commented 1 month ago

Sounds like it would need some renaming:

https://huggingface.co/comfyanonymous/flux_RealismLora_converted_comfyui/blob/main/convert.py

Nerogar commented 1 month ago

That one sounds like a simple tuner issue. The OneTrainer names are the same for all models. They follow the naming convention set by kohya during the SD1.5 days. Changing that doesn't make any sense.

master131 commented 1 month ago

I just checked, it looks like the key renaming is implemented in latest ComfyUI, but not backported to Forge yet.

ComfyUI implementation: https://github.com/comfyanonymous/ComfyUI/blob/f067ad15d139d6e07e44801759f7ccdd9985c636/comfy/lora.py#L327

Forge implementation: https://github.com/lllyasviel/stable-diffusion-webui-forge/blob/668e87f920be30001bb87214d9001bf59f2da275/packages_3rdparty/comfyui_lora_collection/lora.py#L318

I manually patched this file as per below in Forge and all working now, happy days:

            if k.endswith(".weight"):
                to = diffusers_keys[k]
                key_map["transformer.{}".format(k[:-len(".weight")])] = to #simpletrainer and probably regular diffusers flux lora format
                key_map["lycoris_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #simpletrainer lycoris
                key_map["lora_transformer_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #onetrainer
gilga2024 commented 1 month ago

It seems like the flux feature is now available in master. At least for LoRA and DoRA. May I kindly ask if there is any documentation / update on how to use it? I have not found any information on this so far.

Do not need an answer on this anymore. Someone posted a guide on reddit.

protector131090 commented 1 month ago

I think that onetrainer might finally need to implement multigpu for full flux support, because anything over rank 16 lora will probably be untrainable with the 24gb that consumer gpus typically have. which is probably actually a good thing. but then again, with how great a model flux is, we might be pushing rank 4 loras for simple stuff cause you dont need to finetune much, its probably already good at what you want.

up to rank 64 is possible in 1024x1024 with ai toolkit

Tophness commented 1 month ago

I think that onetrainer might finally need to implement multigpu for full flux support, because anything over rank 16 lora will probably be untrainable with the 24gb that consumer gpus typically have. which is probably actually a good thing. but then again, with how great a model flux is, we might be pushing rank 4 loras for simple stuff cause you dont need to finetune much, its probably already good at what you want.

up to rank 64 is possible in 1024x1024 with ai toolkit

128 was possible on my rtx4080 16 GB, although I heard 16-32 was enough for flux and better for details like skin and from my testing so far that does seem to be the case, hard to tell though. I suspect FurkanGozukara will have more conclusive results. I've never been able to fit it entirely in my vram with kohya, so I stopped trying. Best and fastest results so far are adamw8bit / rank 16 / train_t5xxl / split_qkv / loraplus_unet_lr_ratio=4, which is designed for 24GB only. 8GB is spilling into my shared vram, but it's already learned in 2 days what it took 3 weeks to get to on recommended adafactor settings for 16GB cards, so I think it's gonna converge by tomorrow what took a month previously.

FurkanGozukara commented 1 month ago

@Tophness yes OneTrainer is on my radar to prepare configs hopefully this week. We have 16 GB Configs have you tried them? They should fit in. Especially I added very recent cpu offloading versions - reduces VRAM like 200 300 mb. However we do sacrifices a little bit quality compared to 24 GB cards with 16 GB. I hope OneTrainer does handle VRAM as in SDXL , didn't have chance yet. Currently testing T5 training impact

Tophness commented 1 month ago

@Tophness yes OneTrainer is on my radar to prepare configs hopefully this week. We have 16 GB Configs have you tried them? They should fit in. Especially I added very recent cpu offloading versions - reduces VRAM like 200 300 mb. However we do sacrifices a little bit quality compared to 24 GB cards with 16 GB. I hope OneTrainer does handle VRAM as in SDXL , didn't have chance yet. Currently testing T5 training impact

Haven't tried your patreon configs, I just meant from past yt videos and stuff your tests have been pretty comprehensive. I think I might be able to just make it under 16gb since cpu offloading was added. It's supposed to slow down speed by 15%, but surely the overhead of being ~1gb into shared vram is more than 15% anyway, so it should actually speed up? Will have to test after this adamw8bit/t5 run. Quality is more important for the one I'm training atm since it's a very broad concept that touches everything

MNeMoNiCuZ commented 2 weeks ago

Any status on Flux support? Will there be some kind of announcement or update to the main page when Flux support is "stable"?

Cheers