USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

Fixes to offset and dropout #154

Closed SimonTopp closed 2 years ago

SimonTopp commented 2 years ago

This PR updates the RGCN model to work with dropout and recurrent dropout and fixes offset and keep_last_frac to do two things:

janetrbarclay commented 2 years ago

these lines aren't from this PR, but this comment is now incorrect, yes? (the assumption of 1 continuous block and non-overlapping trn / tst dates) https://github.com/USGS-R/river-dl/blob/2f1792aec5cbb0175daa715d29604fa02a060587/river_dl/preproc_utils.py#L89-L91

janetrbarclay commented 2 years ago

I might have missed something, but it's unclear to me how this fixes the issue of the weird gaps in the time series. From what I've seen it isn't only an issue with varying the offset, but from having discontinuous date ranges for the partition data. Seems like maybe we need to force a new batch when there's a break in the dates for the partition? (which then raises the question of what to do with the incomplete batches - drop the data or make a batch with a smaller offset?) - I haven't dug far into this PR, so I very well may have missed something.

SimonTopp commented 2 years ago

I might have missed something, but it's unclear to me how this fixes the issue of the weird gaps in the time series.

@janetrbarclay The gaps you're referring to and the gaps this fixes are different. The way split_into_batches was written would drop out a portion of the data annually so you'd end up with regular gaps in your all your partitions. This doesn't fix the discontinuous training sequences you're referencing. I looked into doing that, but with the current data-prep structure it's actually a non-trivial lift. We should do it, but in the interest of getting this PR through I'd say lets set it aside for another PR

SimonTopp commented 2 years ago

these lines aren't from this PR, but this comment is now incorrect, yes?

@janetrbarclay, I went ahead and updated it for clarity to read

This assumes your training data is in one continuous block. Be aware, if your train/test/val partitions are discontinuous (composed of multiple periods), you'll end up with sequences starting in one period and ending in another.