Closed dbuscombe-usgs closed 1 year ago
Despite the name, I didn't implement the checkpoints, but I left the code in there in case we want to add that later
I think both of these changes to train_model.py
-- run_eagerly=True
and the new callback -- should be a hidden configs, and not the default configuration.
why? there is no downside - training is just as quick
My thinking is this: We know that the adjustments are needed for your setup and that these changes don't degrade performance for you. I am not clear if it is known that they are needed for everyone else and if they degrade performance for others. (i.e., with different hardware).
but I 100% leave it to you to decide...
I hear you ... however, I have tested on 3 machines (2 windows, 1 ubuntu) and it worked well ... plus I benchmarked the test dataset model - it trained just as quick. I did not, however, try without mixed precision. And it is possible that different versions of dependencies e.g. conda versus pip may behave differently
I know you're not set up with conda to test the latest conda env, so perhaps we could ask @CameronBodine to try out this branch and give feedback? He has access to different hardware, as well as Windows and Linux
I guess I'm kind of reluctant to add more config flags, but it's not a big deal to add another one. On this instance, though, I thing the change is minor and won't break anyone's setup. Going forward, if we keep adding config flags, we may want to come up with a better way to parse them out, or perhaps have separate config files for training and deployment
I'm also more than happy to let this branch sit here for a while so we can test it. Right now, perhaps this issue only affects situations where the number of training samples is very large?
Ok I suppose I should make a decision here. I'll compromise by making it a config option. I see your point about it only currently being useful for me. In time it may get incorporated.
However, in the long term I think having 46+ possible config items, some of which are mandatory and some of which are optional, is a clunky way to organize this. We either:
What do you say @2320sharon and @venuswku ?
See https://github.com/Doodleverse/segmentation_gym/issues/117
PR adds the memory clearing loop at the end of each training epoch, as well as passing
eagerly=true
to eachmodel.compile
. This appears to eliminate the memory leak when training times are long on large datasets