resuming from checkpoint

quarterturn commented 2 months ago

Is there a way to resume training from a checkpoint? Let's say maybe you can run it non-stop due to demand-based electric pricing.

maxbizz commented 2 months ago

Did you find an answer to this?

quarterturn commented 2 months ago

Looks like a recent commit added the training script as a text file, so though I haven't tried it, you can probably run the script manually and supply the option to resume from a checkpoint there.

Tablaski commented 2 months ago

I think I've found how but i'm facing some issues myself, so I was going to open a new issue when I found yours

What I found out is you can't resume if you launch the training script with fluxgym.

1) Enabling saving states

What you can do is press Control + C right after you're clicked on start training. Just wait for the script to save all latents files. You can do control + C when it reaches "running training / 学習開始" or after that.

If you Control+C before these files will be corrupt and you will have to get through the UI again to set-up your stuff and do it over again.

Then go into fluxgym\outputs[yourLoraName] and open via Notepad++ the train.bat files This files contains the command line to launch the training script yourself

You will have to remove all the "^" and make it a one liner using Control+J in notepad++ You absolutely have to add the parameter --save_state to the command Otherwise, you cannot resume afterwards, I've tried already so believe me.

What it will do is everytime you do save your training in a .safetensor file in the fluxgym\outputs[YourLoraFolder] it will also create a folder [YourLoraName]-[epochNumber]-state

Maybe save your new command line somewhere if you want to keep the original fluxgym one intact.

Now open your powershell, go in fluxgym dir, switch on the env\scripts\activate and paste the updated command

2) Resuming

Find back your command line and add this parameter :

--resume "c:\fluxgym\outputs[YourLoraName][Folder of the state you want to resume from]"

For example, if your training last save was epoch 9 it would be :

--resume "c:\fluxgym\outputs\MyGreatLora\MyGreatLora-000009-state"

Now this could be enough, it will work, I've done it but now that's where my own issue begins :-( How to avoid the script to count back from step 0 ? The ending step of each epoch is stored in MyGreatLora-000009-state\train_state.json but I can't get it right with the script despite reading the code in fluxgym\train_network.py

I added the following parameters to the command line, along with --resume, to resume a Lora that stopped at epoch 22 and step 12650 :

--initial_epoch 23 --initial_step 12650 --skip_until_initial_step

I had to read the code and ask many questions to chatGPT to try to understand how this worked but it always come back to resuming from step 0 and saving my further train_state.json with steps starting back from 0 as the resume starts

The Lora is still improving though, so it's still worth doing so, but I'd be relieved to have the correct number of steps so I'm sure it is not messing with the optimizer or else.

I've tried edition train_state.json to set the correct step, but long story short I had to force it with --initial_step and --skip_until_initial_step

But the result is it does this strange backwards counting of steps from the number i give it to 0, substracting steps at each epoch.

I've seen a discussion on the accelerate (name of the script used to train) github and they say something about checkpointing.py to save the step etc... For now I really don't get it.

To get here has taken me a lot of time already believe me. I hope this shortcut helps you and that somebody can also help me :(

Any insights ?

quarterturn commented 2 months ago

Resuming from checkpoint works from the standpoint of using the Kohya training script, and the --resume option pointing to the desired checkpoint.

Tablaski commented 2 months ago

@quarterturn can you give command lines examples please, it's not as easy as it seems.

For starters, the --resume argument is waiting for a folder containing the training state. It doesn't work with a checkpoint like you say. I've just done it again to be 100% sure, definitely doesn't work doing --resume "[path][lora].safetensors"

Have you actually done it (I mean successfully resuming a training a flux lora) ?

If not, please read what I wrote earlier before posting a oneliner like this.

If yes, what were the results ? Are you sure it did resume the training taking into account all the previous steps ? Because it generates samples and training states with steps resetting from zero. So i'm not too sure if it's working properly at the moment

I've spent hours reading the code of this script today training_network.py to solve this

Tablaski commented 2 months ago

Additional proof of what i'm saying, from the official readme.md, translated from japanese. Now the bad news is these instructions are insufficient :

In train_network.py and sdxl_train_network.py, it is now possible to restore the loading order of the dataset when resuming training. Thanks to KohakuBlueleaf for PRs #1353 and #1359.

This solves the issue of the dataset loading order changing when resuming training.

If you specify the --skip_until_initial_step option, dataset loading will be skipped until the specified step. If not specified, the behavior remains unchanged (the dataset will be loaded from the beginning).

If you specify the --resume option, the step count saved in the state will be used.

If you specify the --initial_step or --initial_epoch options, dataset loading will be skipped until the specified step or epoch.

Please use these options in combination with --skip_until_initial_step. You can also use these options without the --resume option (e.g., when resuming training using --networkweights).

Tablaski commented 2 months ago

Still researching the --resume feature in accelerate, I'm convinced it can't be trusted now

Refer to https://github.com/kohya-ss/sd-scripts/discussions/772 - I got resuming step = 0 when using --resume, after it accesses the saved training state

I do hope I'm wrong, but I don't see how

Vigilence commented 1 month ago

Definitely need a resume option that works.

Tablaski commented 1 month ago

Actually, cheer up because since I last wrote that I've spent a lot of time in the code and putting loggers.

Resume actually works fine, you just have to use --save_states from the start then use --resume "path\yourlora-0000xx-state" and then it'll work.

This is backed up by watching the loss curve using tensorboard, it does not start from scratch. I've been using it for maybe a week now and nothing special to report.

The progress bar is very misleading, starting from zero again, and also there is a bug in the train_state.json saving so the current_step value is incorrect, so further ones will be saved starting again from zero at every resume

But don't worry like I did... it doesn't change the internal trainer step, which is reset at zero at the end of every epoch. The only case where they might be a problem is if you're using a parameter to save at a given state, then i'm not sure the trainer internal step is properly saved in the binary state file.

I've done a local fix for the progress bar and train_state file that I need to clean up and suggest to kohya but then I have a lots of other things to hack.

AllyourBaseBelongToUs commented 1 month ago

Actually, cheer up because since I last wrote that I've spent a lot of time in the code and putting loggers.

Resume actually works fine, you just have to use --save_states from the start then use --resume "path\yourlora-0000xx-state" and then it'll work.

This is backed up by watching the loss curve using tensorboard, it does not start from scratch. I've been using it for maybe a week now and nothing special to report.

The progress bar is very misleading, starting from zero again, and also there is a bug in the train_state.json saving so the current_step value is incorrect, so further ones will be saved starting again from zero at every resume

But don't worry like I did... it doesn't change the internal trainer step, which is reset at zero at the end of every epoch. The only case where they might be a problem is if you're using a parameter to save at a given state, then i'm not sure the trainer internal step is properly saved in the binary state file.

I've done a local fix for the progress bar and train_state file that I need to clean up and suggest to kohya but then I have a lots of other things to hack.

Awesome work Tablaski, if i didn't misunderstand, we could open the advanced options in FLuxgym by Pinokio, pick --save_states from the get-go, else --resume only works by going to the last epoch but not the intermediate steps in each epoch?

How much time does pass between Epochs for you?

Update from my side, when using --initial_step 24 ^

in the train.bat ( just starting it in CMD line then there is no need to remove al the "^" and parse it as one line, just start the train.bat )

then it actually shows that its starting from step X, aka whatever you specified, atleast thats what it writes in the CMD, didnt finish training yet from an advanced state to see if it works properly.

where can we learn about using tensorboard though?

Tablaski commented 1 month ago

Yes you absolute need --save_state during training if you want to resume further ; impossible otherwise

Then you will need --save_every_n_epochs n where n = number of epochs when you want to save. I use 1 so it's saved every 1 epoch

You can also use --save_state_on_train_end if you just want to save at the end (last epoch) but then you have to ensure nothing crashes your training, and that setting sucks because you won't be able to resume from an earlier epoch if your model starts overfitting

If you are paranoid you can use --save_every_n_steps n where n is the number of saves so you could save every half epoch for instance. I don't use that, one per epoch is enough for me

For the time between epochs it really depends how much repeats you are using and if you are using regularization pictures (it doubles the steps)

for tensorboard it's really easy to set up, a 5 minute job. you basically do "pip install tensorboard", activate your env\script\activate in the fluxgym folder, then "tensorboard --logdir [log path]

You got to use --log_with tensorboard and --logging_dir [log path] to set up the logs that can be then open with tensorboard

I've found tensorboard REALLY useful to monitor what you're doing with your training, anticipating, and learning about training

Btw best course is asking chatgpt about training stuff ; you have no idea how much i've learned about training by asking chat gpt about every option and concept. I knew more in a few weeks than a friend I know who has been training for like a year or more. Try to compare what chatgpt says about stuff with forums and articles from times to times, for a few concepts, people experience was more accurate, but overall it gives good advices

AllyourBaseBelongToUs commented 1 month ago

On slow systems saving every epoch takes quite some time

though idk what i broke right now, increased number of workers to 40 and its generating in lightning speed compared to before :O even the image generation is going at 2 iterations per second (even more weird, it didnt work 3 times with exactly the same settings then all of a sudden goes berserk speed)

using the 8GB option now, 40 workers in fluxgym by pinokio.

on the other hand, i barely see improvements in face, aka it doesnt look closer to me, only some clothes...

Do we have to crop and focus on the face? or did you get better results by changing the prompt?

Tablaski commented 1 month ago

What setup do you have ? Whats the batch size ? Whats the resolution ? I get roughly around 3 seconds per step using batch 6

I did not gain anything significant in terms of speed by playing with dataloaders, workers, etc Batch size however yes

When training faces i crop everything but faces

AllyourBaseBelongToUs commented 1 month ago

EDIT::: lol because i was too lazy to close the CMD window i actually got everything (saved it to a notepad now)

saving checkpoint: D:\Users\Eddy\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e\kh-2e.safetensors INFO model saved. train_network.py:1298 steps: 960it [1:03:42, 3.98s/it, avr_loss=0.267]

thats how long it took from start to finish, so with picking 16GB in the GUI of FLuxGym, will scour through the

Laptop RTX 4080 12GBVram, 64GB RAM, i9-12900

resolution was 512x512, batch size 1, what does batch size use for resources?´

(the logs seem to indicate threads per process doesn't do anything and is turned back to 1, changing that didn't affect anything either)

In any case, do you want me to parse the whole txt file or should i look for something specific?

accelerate launch ^
  --mixed_precision bf16 ^
  --num_cpu_threads_per_process 4 ^ 
  sd-scripts/flux_train_network.py ^
  --pretrained_model_name_or_path "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\models\unet\bdsqlsz\flux1-dev2pro-single\flux1-dev2pro.safetensors" ^
  --clip_l "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\models\clip\clip_l.safetensors" ^
  --t5xxl "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\models\clip\t5xxl_fp16.safetensors" ^
  --ae "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\models\vae\ae.sft" ^
  --cache_latents_to_disk ^
  --save_model_as safetensors ^
  --sdpa --persistent_data_loader_workers ^
  --max_data_loader_n_workers 40 ^
  --seed 47 ^
  --gradient_checkpointing ^
  --mixed_precision bf16 ^
  --save_precision bf16 ^
  --network_module networks.lora_flux ^
  --network_dim 4 ^
  --optimizer_type adafactor ^
  --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" ^
  --lr_scheduler constant_with_warmup ^
  --max_grad_norm 0.0 ^--sample_prompts="D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e\sample_prompts.txt" --sample_every_n_steps="24" ^
  --learning_rate 8e-4 ^
  --cache_text_encoder_outputs ^
  --cache_text_encoder_outputs_to_disk ^
  --fp8_base ^
  --highvram ^
  --initial_step 30 ^
  --skin_to_initial_step ^
  --max_train_epochs 16 ^
  --save_every_n_epochs 1 ^
  --dataset_config "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e\dataset.toml" ^
  --output_dir "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e" ^
  --output_name kh-2e ^
  --timestep_sampling shift ^
  --discrete_flow_shift 3.1582 ^
  --model_prediction_type raw ^
  --save_every_n_steps 40 ^
  --save_state ^
  --guidance_scale 1 ^
  --resume "D:\Users\EddyE\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e\kh-2e-step00000030-state" ^
  --sample_at_first ^
  --log_config ^
  --save_state_on_train_end
REM  --initial_step 1
REM  --log_config
REM  --sample_at_first

AllyourBaseBelongToUs commented 1 month ago

What setup do you have ? Whats the batch size ? Whats the resolution ? I get roughly around 3 seconds per step using batch 6

I did not gain anything significant in terms of speed by playing with dataloaders, workers, etc Batch size however yes

When training faces i crop everything but faces

there are several of them though

train_batch_size and train_encoder_batch_size and regular batch_size also Vae_batchsize

which ones did you modify for the performance increase?

Tablaski commented 1 month ago

This is getting further for the main subject so I will try to make it to the point. The stuff that made the most difference for me (RTX4090 16 GB VRAM) are :

--fp8_base : if i remove that, terrible performance (needs more VRAM) --split_mode : if I remove that, it's a disaster (needs more VRAM) although perhaps it would work with batch size 1

resolution in dataset.toml : if i go to 1024, it's much slower

batch size in dataset.toml : you got to monitor your VRAM usage and see what is the optimal you can attain. 6 is my sweet spot, 8 would be a disaster. It's basically how many pictures your GPU will process in parallel. With size 1 it takes actually longer than size 6. You got to adjust your learning rate according. Rule of thumb is multiply it by the batch size and then lower it to be careful. I personally use 3e-3. If you're not increasing that is equivalent of being able to read six pages of a book at the same time but retain the same amount of stuff as reading only one at a time.

=> This is the most important setting for improving speed, going from 1 to 6 reduced by 2,5x time my total training time.

Other settings like mixed precision, loaders and cpu workers, I have played with them but not a big difference for me that I would remember. Just use a sufficient amount of loaders and cpu workers, but not too much otherwise it actually slows you down because the script has to synchronize many threads

Also, I cannot emphasize that enough :

--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler constant_with_warmup

THIS SETTING IS ABSOLUTE COMPLETE SHIT, to the point I'd like to write to @cocktailpeanut to remove it and replace by this which is amazing :

--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler cosine --lr_warmup_steps 0.2 --lr_decay_steps 0.8

Why ?

Constant with warmup : starts the learning rate extremely slow and increases using a linear fonction. That makes learning extremely slow at the start, takes many epoch to have results, then overfits the model when reaching later epochs

Cosine + warmup + warmup 20% + decay 80% :starts increasing learning rate using a linear fonction for 20% of your steps, then stays at the maximum for some time, then slowly decreases it, then a bit faster, then even faster during second half of steps, eventually landing very low as you reach the final epoch. This allow you to train carefully at the beginning before the trainer has seen your dataset fully, then learn properly, then slow down before risking overfitting, and then refine with more details while not risking much on later epochs

THIS IS EVEN MORE IMPORTANT THAN SPEED

We're talking about a setting that will save you many resumes or complete re-trainings because you've just wasted your training time.

This new setting is very solid and made me very confident, whereas I would be very anxiously reading the average loss curve before with constant_with_warmup, and resuming multiple times to manually adjust learning rates because things would turn very wrong at some point

I am NEVER using again the default setting.

Tablaski commented 1 month ago

That's it, I just opened a new issue about cosine with warmup, we need to get rid of linear_with_warmup ASAP

sitzbrau commented 1 week ago

hi everybody, can someone help me? I tried to resume the training as suggested in this thread but from the log it seems that the file I set to repair does not exist, even though as you can see from the terminal it is present and the correct directory

[2024-11-22 15:47:29] [INFO] ValueError: Tried to find /home/ubuntu/fluxgym/outputs/sony-xdcam/sony-xdcam-000012.safetensors but folder does not exist

Resume file path

ubuntu@ip-172-31-37-125:~/fluxgym/outputs/sony-xdcam$ realpath sony-xdcam-000012.safetensors
/home/ubuntu/fluxgym/outputs/sony-xdcam/sony-xdcam-000012.safetensors
ubuntu@ip-172-31-37-125:~/fluxgym/outputs/sony-xdcam$ du -h sony-xdcam-000012.safetensors
38M     sony-xdcam-000012.safetensors
ubuntu@ip-172-31-37-125:~/fluxgym/outputs/sony-xdcam$ ls -l
total 116560
-rw-rw-r-- 1 ubuntu ubuntu      655 Nov 22 15:47 README.md
-rw-rw-r-- 1 ubuntu ubuntu      263 Nov 22 15:39 dataset.toml
-rw-rw-r-- 1 ubuntu ubuntu        5 Nov 22 15:39 sample_prompts.txt
-rw-rw-r-- 1 ubuntu ubuntu 39778464 Nov 21 13:49 sony-xdcam-000004.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 39778464 Nov 21 15:14 sony-xdcam-000008.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 39778464 Nov 21 16:40 sony-xdcam-000012.safetensors
-rw-rw-r-- 1 ubuntu ubuntu     1304 Nov 22 15:39 train.sh

Train script

accelerate launch \
  --mixed_precision bf16 \
  --num_cpu_threads_per_process 1 \
  sd-scripts/flux_train_network.py \
  --pretrained_model_name_or_path "/home/ubuntu/fluxgym/models/unet/flux1-dev.sft" \
  --clip_l "/home/ubuntu/fluxgym/models/clip/clip_l.safetensors" \
  --t5xxl "/home/ubuntu/fluxgym/models/clip/t5xxl_fp16.safetensors" \
  --ae "/home/ubuntu/fluxgym/models/vae/ae.sft" \
  --cache_latents_to_disk \
  --save_model_as safetensors \
  --sdpa --persistent_data_loader_workers \
  --max_data_loader_n_workers 2 \
  --seed 42 \
  --gradient_checkpointing \
  --mixed_precision bf16 \
  --save_precision bf16 \
  --network_module networks.lora_flux \
  --network_dim 4 \
  --optimizer_type adamw8bit \
  --learning_rate 8e-4 \
  --cache_text_encoder_outputs \
  --cache_text_encoder_outputs_to_disk \
  --fp8_base \
  --highvram \
  --max_train_epochs 16 \
  --save_every_n_epochs 4 \
  --dataset_config "/home/ubuntu/fluxgym/outputs/sony-xdcam/dataset.toml" \
  --output_dir "/home/ubuntu/fluxgym/outputs/sony-xdcam" \
  --output_name sony-xdcam \
  --timestep_sampling shift \
  --discrete_flow_shift 3.1582 \
  --model_prediction_type raw \
  --guidance_scale 1 \
  --loss_type l2 \
  --initial_epoch 13 \
  --resume /home/ubuntu/fluxgym/outputs/sony-xdcam/sony-xdcam-000012.safetensors

Complete log

attached train-log.txt

cocktailpeanut / fluxgym