Feedback on potentially significant changes

SimonTopp commented 2 years ago

I'm planning to make some potentially significant changes to the repository next week, and I wanted to create a space to chat about them before I actually do it and submit the PR. The two main things I'm planning are:

Update the training routine to validate the model after each epoch and save the weights for the final model as the epoch with the best validation. This is functionally similar to implementing an early stopping metric and should help avoid overfitting if there is any.
Update the pre-training to pre-train on the entire dataset instead of just the train partition.

Two specific questions for you all, @aappling-usgs, @jsadler2, @janetrbarclay , @jzwart. 1) Do you see any issues with these changes, and 2) Should these changes be implemented as arguments (i.e. you could still run it the old way) or should they just be the defaults/hard coded in. Let me know what you think! I'll probably get to this mid-week next week, so no rush.

janetrbarclay commented 2 years ago

Having spent some time trying to recreate prior results, it seems like there's some benefit to being able to relatively easily run it the old way. May not need to be an argument in the config file, but maybe something in the Snakefile?

Would this involve splitting the training rule into two separate rules since the inputs are different or would you handle that within the train function? I've lost track of the number of times with the gw loss function that the model has successfully completed the pretraining but failed on the fine-tuning and needed to start over. If it's an easy lift, would it be worth separating these so only the second needed to be redone?

Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108

Phone (office) 860 291-6763 Fax 860 291-6799 Email @.**@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay

From: Simon Topp @.> Sent: Friday, October 22, 2021 11:34 AM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Mention @.> Subject: [EXTERNAL] [USGS-R/river-dl] Feedback on potentially significant changes (Issue #135)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

I'm planning to make some potentially significant changes to the repository next week, and I wanted to create a space to chat about them before I actually do it and submit the PR. The two main things I'm planning are:

Update the training routine to validate the model after each epoch and save the weights for the final model as the epoch with the best validation. This is functionally similar to implementing an early stopping metric and should help avoid overfitting if there is any.
Update the pre-training to pre-train on the entire dataset instead of just the train partition.

Two specific questions for you all, @aappling-usgshttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faappling-usgs&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820101726%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yrcDZC%2FQLv8HUGGYvFLXtiPQIXHxi3ExjsxgHi8OjvI%3D&reserved=0, @jsadler2https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjsadler2&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820111682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8yh1ZDXDUZ2v%2BNoi1J0LfW15wY340KLOzGN4nNAXRLA%3D&reserved=0, @janetrbarclayhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjanetrbarclay&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820111682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1ZHosW65W2cvZHPF0989FXCAR8nzzpRzhYTwrclOxgY%3D&reserved=0 , @jzwarthttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjzwart&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820121640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6JzD4GV6kbrv85ep9Jwh5Wssjv%2Bex4iQZGLhfYNsNzw%3D&reserved=0. 1) Do you see any issues with these changes, and 2) Should these changes be implemented as arguments (i.e. you could still run it the old way) or should they just be the defaults/hard coded in. Let me know what you think! I'll probably get to this mid-week next week, so no rush.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-R%2Friver-dl%2Fissues%2F135&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820121640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=T8HTwws96688CpM%2F7SN1WjW%2BwPDunmn5y8tFw1bP99c%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5H7UFWPSAQ4BOJNXHCVPTUIF77FANCNFSM5GQ2ONGA&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820131595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UMuEw8TGY636uffGZqUrYV610unbmcA7S9pSttDhkRY%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820131595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BTcf9APeMVpg6EYzKc8XaMad4KY%2BSiAZU4FjyBEQRbI%3D&reserved=0 or Androidhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cjbarclay%40usgs.gov%7Cb4434da2622a46e7856e08d99571762f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637705136820141559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dW6R%2F6G%2FvLI6JLSqHK3V8mU4Czx4pOjsp5AhjWBopCo%3D&reserved=0.

jzwart commented 2 years ago

those seems like good additions to me and I'd recommend for having those as arguments with the defaults as you have proposed. I think making use of releases / tags for running the old versions would be best for recreating old model results - e.g. these results were run with model v0.1.4 w/ XX config files (or something like that).

@jdiaz4302 might have some input too

SimonTopp commented 2 years ago

That all sounds good to me. It makes sense that we'll want to re-create old results down the line. With that in mind, does it make sense to archive the current version of the repository with a version tag? And from here on out whenever we make significant changes to the pipeline we can update the version number and it'll get logged somewhere in the output? Maybe in Janet's asRun-config? @janetrbarclay, I wasn't planning to break training into two separate parts. I think we could, but what feels most right to me is fixing the underlying bug that breaks fine tuning rather than creating work-arounds to accommodate it.

jsadler2 commented 2 years ago

I like these ideas.

versioning I like the idea of having specific versions/tags when big(ish) changes like this are pulled in. One worry I have is that that may become just one more thing to keep track of. So I'm not totally convinced that it's worth doing "one more thing" with PRs, but at the same time I for sure see the value of being able to reference a certain point where the code was.
pretrain data I agree that it makes sense to have the default of pretraining on all the data available
early stop Before you start writing custom code for 1, @SimonTopp, you might want to look into this that is already written by TF: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping. If the idea is just to do early stopping, that might fit the bill and would be cheaper to implement and maintain
separate training steps Regarding @janetrbarclay's point about breaking into different training routines, I had the thought recently of a way to make this more flexible which would address her problem and the issue of pretraining on a different dataset. Here it is: could we just have very flexible/generic training function that takes as the input basically the model, the data, epochs, the output file name? That way you could just pass the model as an argument and then make different snakemake rules that call that generic train function for different parts of the training routine.

jdiaz4302 commented 2 years ago

I like the idea of versions if they're used to mark significant changes as you move forward, but I see Jeff's point that more versions could be more hassle especially if versions are used as alternate workflows rather than past workflows (more of an issue with how we use what we have rather than an issue with having it).

Also agree with new arguments with default values especially for the early stopping (I'm less clear on why someone would want to disable pretraining on the whole pretraining data set in favor of a portion of it - maybe just reproducing previous work?).

I think having the pretraining separate from the training is in line with the pipeline way of thinking since it would be breaking two expensive tasks apart, but that could be a separate issue and I'm not familiar with this project enough to know where that implementation lies between trivial and a hassle.

SimonTopp commented 2 years ago

Sounds good. I just rewrote the training routine this morning to allow for early stopping and to function more as a modular training function like @jsadler2 recommended. I'll plan to make the updates arguments so it's easy to reproduce our old approach. I got a little sidelined this week, but I'll try to push a PR early next week with everything.

SimonTopp commented 2 years ago

Addressed in #141.

USGS-R / river-dl

Feedback on potentially significant changes #135