more training / validation options

Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler

MIT License

45 stars 11 forks source link

more training / validation options #73

Closed dbuscombe-usgs closed 1 year ago

dbuscombe-usgs commented 2 years ago

Create a way to add more training / validation options, including

[ ] train on augmented and validate on non-augumented
[ ] train on augmented and validate on validation set, but test on hold-out set (augmented, non-augmented, or both)

dbuscombe-usgs commented 2 years ago

This will require changes to config file, and make_datasets script.

The wider issue is that users need to be told that a test (hold out, independent) dataset should be created for true test of model skill. Validation metrics are only useful to a point, but do not reveal how well a model performs out of distribution (different time/space/season/weather/etc than is represented in the training/validation dataset)

Eventually we could have a separate 'evaluation' dataset, where users specifically evaluate model performance against a test set of images and associated labels. Metrics would be generated that reveal skill of that test data. Test data could/should be added to over time as the ML project matures

In the meantime, README and wiki should be updated to state

the need for a test dataset
the test dataset is a domain/task specific problem

dbuscombe-usgs commented 1 year ago

@ebgoldstein discussed this again earlier this week

make_datasets would need to be updated to:

create separate train, test, and validation sets of npzs, rather than one list of npzs in a single folder
VALIDATION_SPLIT would enter here, dictating how much data was provisioned for the the non-augmented validation subset
a new config variable could be used, TEST_SPLIT, to create the third independent set

In train_model, we would:

train on augmented imagery
validate on non-augmented imagery
test on non-augmented imagery

VALIDATION_SPLIT would be removed, as would the filename shuffle. Filenames for each split would instead be read directly from the respective folders

MODE would go - there would be no need for it

Any other thoughts, @ebgoldstein?

As presented above, it would be a moderate amount of work -- probably only 1--2 days

dbuscombe-usgs commented 1 year ago

I could work on this next @ebgoldstein . My proposed changes would remove some config parameters .... perhaps the augmentation params could get put in a different config? happy to discuss

ebgoldstein commented 1 year ago

to me, the workflow would be:

Make_dataset: -takes list of files -splits into train/val lists (this code would move from train to makedataset -in the npz4gym folder, creates folder for train data and folder for val data -non aug train images are put into train data folder -non aug validation images are put into val data folder -train images are augmented depending on existing aug configs, and put into train data folder

then in train_model: -train_ds is made from train folder -val_ds is made from val folder

I think this would cause no additional configs, remove the data leak between train and val, and make sure to always validate on non-aug imagery.

ebgoldstein commented 1 year ago

oh, i see now you outlined something similar above :facepalm: .. sorry... reading it i think we are on the same page..

ebgoldstein commented 1 year ago

I don't care too much about having a test set in gym.. i tend to do that later, but i am fine if its incorporated too..

dbuscombe-usgs commented 1 year ago

Yeah I agree including a test set may be problematic for users with small datasets. Plus, they would likely be all drawn from the same distribution of imagery as the train and val sets, so wouldn't be a good test for out-of-distribution application

dbuscombe-usgs commented 1 year ago

I can make a new branch and implement this idea this week

dbuscombe-usgs commented 1 year ago

I started a new branch and have started work on implementing this idea. More soon ...

dbuscombe-usgs commented 1 year ago

Done. See https://github.com/Doodleverse/segmentation_gym#new-in-may-2023

commit: https://github.com/Doodleverse/segmentation_gym/commit/d11a3f63531cd9baf1575a9732dd8210781ae316

changes to: https://github.com/Doodleverse/segmentation_gym/blob/main/make_dataset.py https://github.com/Doodleverse/doodleverse_utils/blob/main/doodleverse_utils/make_mndwi_dataset.py https://github.com/Doodleverse/doodleverse_utils/blob/main/doodleverse_utils/make_ndwi_dataset.py

tested on several datasets