Closed dxqbYD closed 1 month ago
By the way, the GUI has become very confusing now.
How about this?
Something like that? Or perhaps there's an even better way to label this?
They could be visually relabeled without changing the internal save_after
setting's name for backwards compatibility reasons. Or perhaps that can be changed too. Nero will be a much better judge of that.
Do you think there is a use case for using different units for save_after_unit
and skip_save_unit
? If not, what do you think about adding only a new value so we have save_after
, save_after_skip
, save_after_unit
. That could be displayed in a single component with these three inputs and maybe a better label. leaving the save_after_skip
input empty would disable skipping and always save.
They could be visually relabeled without changing the internal
save_after
setting's name for backwards compatibility reasons
changing the name of the setting would be possible by adding a migration to TrainConfig
I think it would also make sense to add the same functionality for backups.
Yeah, having only a single unit selector dropdown seems to make sense. I can't think of any scenario where we'd want to use "Save Every 500 Steps, But Skip Early Steps: 1 hour". That's a very confusing way of thinking about it. It makes more sense to use the same unit for both things.
Adding a setting-migration to use the clearer label save_every
is a good idea too.
And yeah, having this functionality for backups would be a nice bonus.
Do you think there is a use case for using different units for
save_after_unit
andskip_save_unit
?
No, but after seeing that the number entry and unit is one combined UI widget I went for this to keep the PR simple. If you all agree that this PR will be merged, I'll look into a better UI to have only one unit.
I think it would also make sense to add the same functionality for backups.
I think rolling backups are already that functionality best suited for backups. You want to keep all saves except the early ones (skip_save), but you want the early backups until they are outdated (rolling backups)
If you all agree that this PR will be merged
Yeah it's a great idea! Let's continue it. :)
As for displaying it in a single component/label row, it might be hard to make that clear for the users. But maybe something like this:
Save Every / Skip Early: [ 500 ] [ 0 ] [ STEP ]
But that may not fit the label width of the other rows, and may still not be clear enough.
So maybe two rows:
Save Every: [ 500 ] [ STEP ]
Skip Early: [ 0 ]
this is a very welcomed addition. Before this I was thinking about an option to specify multiple exact numbers of steps/epochs, separated by a comma, to be saved. But also this is a good enough option. Hope it will be implemented soon. Thanks.
Internal parameter names adjusted and config migration added
For this I needed to add a parameter to util.ui.components.entry and noticed a lot of code duplication there. Carefully checked: can be removed without changing anything
Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first
... Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first
Just as an aside I can think of instances where the user will want the first few epochs then after that none for awhile, this is particularly useful for finetunes like my group did. Needed to watch and test the model carefully for the first 10 epochs but after that it was just slow n steady up until we reached 40 epochs.
Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first
No, it's not intentional. And since this is only a type hint that's not actually really used apart from some linting, I've changed it.
There's also a small issue with the implementation of single_action_elapsed(). When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.
Excellent work @dxqbYD! :)
When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.
If it's just a timestamp, it's doable by storing the elapsed time in the backup and then creating a fake start_time based on that when resuming. Generating a date object that's "now() minus X seconds ago" to restore the backed up "elapsed time".
Should we make a ticket to track the bug, or let it be? It's not a big bug.
Automatic saving is used to have multiple checkpoints after training, so you can choose the best one. However, you know in advance that the first x000 steps will not be the best one. This PR adds a function to skip saving in those first x000 steps.
This example saves every 500 steps, but not within the first 3000:
Especially useful for full finetunes with large file sizes. The first few checkpoints waste both disk space and the time it takes to save.