clear memory training - Githubissues

Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler

MIT License

45 stars 11 forks source link

clear memory training #122

Closed dbuscombe-usgs closed 1 year ago

dbuscombe-usgs commented 1 year ago

See https://github.com/Doodleverse/segmentation_gym/issues/117

PR adds the memory clearing loop at the end of each training epoch, as well as passing eagerly=true to each model.compile. This appears to eliminate the memory leak when training times are long on large datasets

dbuscombe-usgs commented 1 year ago

Despite the name, I didn't implement the checkpoints, but I left the code in there in case we want to add that later

ebgoldstein commented 1 year ago

I think both of these changes to train_model.py -- run_eagerly=True and the new callback -- should be a hidden configs, and not the default configuration.

dbuscombe-usgs commented 1 year ago

why? there is no downside - training is just as quick

ebgoldstein commented 1 year ago

My thinking is this: We know that the adjustments are needed for your setup and that these changes don't degrade performance for you. I am not clear if it is known that they are needed for everyone else and if they degrade performance for others. (i.e., with different hardware).

but I 100% leave it to you to decide...

dbuscombe-usgs commented 1 year ago

I hear you ... however, I have tested on 3 machines (2 windows, 1 ubuntu) and it worked well ... plus I benchmarked the test dataset model - it trained just as quick. I did not, however, try without mixed precision. And it is possible that different versions of dependencies e.g. conda versus pip may behave differently

I know you're not set up with conda to test the latest conda env, so perhaps we could ask @CameronBodine to try out this branch and give feedback? He has access to different hardware, as well as Windows and Linux

I guess I'm kind of reluctant to add more config flags, but it's not a big deal to add another one. On this instance, though, I thing the change is minor and won't break anyone's setup. Going forward, if we keep adding config flags, we may want to come up with a better way to parse them out, or perhaps have separate config files for training and deployment

dbuscombe-usgs commented 1 year ago

I'm also more than happy to let this branch sit here for a while so we can test it. Right now, perhaps this issue only affects situations where the number of training samples is very large?

dbuscombe-usgs commented 1 year ago

Ok I suppose I should make a decision here. I'll compromise by making it a config option. I see your point about it only currently being useful for me. In time it may get incorporated.

However, in the long term I think having 46+ possible config items, some of which are mandatory and some of which are optional, is a clunky way to organize this. We either:

split it all up into separate configs, like 'train', 'dataset', and 'predict', or perhaps 'base' and 'optional'
come up with an alternative to the config file for all the user inputs. what does this look like? a database file? a GUI? a series of text prompts?

What do you say @2320sharon and @venuswku ?