Fixes to offset and dropout

USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems

Creative Commons Zero v1.0 Universal

21 stars 15 forks source link

Fixes to offset and dropout #154

Closed SimonTopp closed 2 years ago

SimonTopp commented 2 years ago

This PR updates the RGCN model to work with dropout and recurrent dropout and fixes offset and keep_last_frac to do two things:

The previous version of split_into_batches created gaps in train/tst/val partitions for certain offset values (see "wonky time series" in #152), the new version fixes that.
You can now pass either a fraction or constant number as offset and keep_last_frac. Values below one are taken as a fraction to keep, values > 1 are a constant number to offset by or keep at the end of the prediction sequence.

janetrbarclay commented 2 years ago

these lines aren't from this PR, but this comment is now incorrect, yes? (the assumption of 1 continuous block and non-overlapping trn / tst dates) https://github.com/USGS-R/river-dl/blob/2f1792aec5cbb0175daa715d29604fa02a060587/river_dl/preproc_utils.py#L89-L91

janetrbarclay commented 2 years ago

I might have missed something, but it's unclear to me how this fixes the issue of the weird gaps in the time series. From what I've seen it isn't only an issue with varying the offset, but from having discontinuous date ranges for the partition data. Seems like maybe we need to force a new batch when there's a break in the dates for the partition? (which then raises the question of what to do with the incomplete batches - drop the data or make a batch with a smaller offset?) - I haven't dug far into this PR, so I very well may have missed something.

SimonTopp commented 2 years ago

I might have missed something, but it's unclear to me how this fixes the issue of the weird gaps in the time series.

@janetrbarclay The gaps you're referring to and the gaps this fixes are different. The way split_into_batches was written would drop out a portion of the data annually so you'd end up with regular gaps in your all your partitions. This doesn't fix the discontinuous training sequences you're referencing. I looked into doing that, but with the current data-prep structure it's actually a non-trivial lift. We should do it, but in the interest of getting this PR through I'd say lets set it aside for another PR

SimonTopp commented 2 years ago

these lines aren't from this PR, but this comment is now incorrect, yes?

@janetrbarclay, I went ahead and updated it for clarity to read

This assumes your training data is in one continuous block. Be aware, if your train/test/val partitions are discontinuous (composed of multiple periods), you'll end up with sequences starting in one period and ending in another.