Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.81k stars 153 forks source link

added Skip Save interval before automatic saving starts #482

Closed dxqbYD closed 1 month ago

dxqbYD commented 1 month ago

Automatic saving is used to have multiple checkpoints after training, so you can choose the best one. However, you know in advance that the first x000 steps will not be the best one. This PR adds a function to skip saving in those first x000 steps.

This example saves every 500 steps, but not within the first 3000:

grafik

Especially useful for full finetunes with large file sizes. The first few checkpoints waste both disk space and the time it takes to save.

Arcitec commented 1 month ago

By the way, the GUI has become very confusing now.

How about this?

Something like that? Or perhaps there's an even better way to label this?

They could be visually relabeled without changing the internal save_after setting's name for backwards compatibility reasons. Or perhaps that can be changed too. Nero will be a much better judge of that.

Nerogar commented 1 month ago

Do you think there is a use case for using different units for save_after_unit and skip_save_unit? If not, what do you think about adding only a new value so we have save_after, save_after_skip, save_after_unit. That could be displayed in a single component with these three inputs and maybe a better label. leaving the save_after_skip input empty would disable skipping and always save.

They could be visually relabeled without changing the internal save_after setting's name for backwards compatibility reasons

changing the name of the setting would be possible by adding a migration to TrainConfig

I think it would also make sense to add the same functionality for backups.

Arcitec commented 1 month ago

Yeah, having only a single unit selector dropdown seems to make sense. I can't think of any scenario where we'd want to use "Save Every 500 Steps, But Skip Early Steps: 1 hour". That's a very confusing way of thinking about it. It makes more sense to use the same unit for both things.

Adding a setting-migration to use the clearer label save_every is a good idea too.

And yeah, having this functionality for backups would be a nice bonus.

dxqbYD commented 1 month ago

Do you think there is a use case for using different units for save_after_unit and skip_save_unit?

No, but after seeing that the number entry and unit is one combined UI widget I went for this to keep the PR simple. If you all agree that this PR will be merged, I'll look into a better UI to have only one unit.

I think it would also make sense to add the same functionality for backups.

I think rolling backups are already that functionality best suited for backups. You want to keep all saves except the early ones (skip_save), but you want the early backups until they are outdated (rolling backups)

Arcitec commented 1 month ago

If you all agree that this PR will be merged

Yeah it's a great idea! Let's continue it. :)

As for displaying it in a single component/label row, it might be hard to make that clear for the users. But maybe something like this:

Save Every / Skip Early:   [  500 ]    [    0 ]    [  STEP      ]

But that may not fit the label width of the other rows, and may still not be clear enough.

So maybe two rows:

Save Every:   [  500 ]    [  STEP      ]
Skip Early:   [    0 ]
djp3k05 commented 1 month ago

this is a very welcomed addition. Before this I was thinking about an option to specify multiple exact numbers of steps/epochs, separated by a comma, to be saved. But also this is a good enough option. Hope it will be implemented soon. Thanks.

dxqbYD commented 1 month ago

grafik

Internal parameter names adjusted and config migration added

For this I needed to add a parameter to util.ui.components.entry and noticed a lot of code duplication there. Carefully checked: can be removed without changing anything

Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

O-J1 commented 1 month ago

... Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

Just as an aside I can think of instances where the user will want the first few epochs then after that none for awhile, this is particularly useful for finetunes like my group did. Needed to watch and test the model carefully for the first 10 epochs but after that it was just slow n steady up until we reached 40 epochs.

Nerogar commented 1 month ago

Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

No, it's not intentional. And since this is only a type hint that's not actually really used apart from some linting, I've changed it.

There's also a small issue with the implementation of single_action_elapsed(). When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.

Arcitec commented 1 month ago

Excellent work @dxqbYD! :)

When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.

If it's just a timestamp, it's doable by storing the elapsed time in the backup and then creating a fake start_time based on that when resuming. Generating a date object that's "now() minus X seconds ago" to restore the backed up "elapsed time".

Should we make a ticket to track the bug, or let it be? It's not a big bug.