USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

Multiple train/test splits result in discontinuous batches #127

Open SimonTopp opened 3 years ago

SimonTopp commented 3 years ago

https://github.com/USGS-R/river-dl/blob/a7629eb2aa1bd4c33a7c00fc6d650d7c6bd9ee3f/river_dl/preproc_utils.py#L112-L117

Here, if we have discontinuous training and testing groups (i.e. two sets of date ranges for both), and batches are set to anything other than 365, then I think this results in one batch that starts in the first date range and ends in the second. I think we should first group by water year, then split into batches and just pad and/or drop the last one. What do you all think?

jsadler2 commented 2 years ago

Interesting. It's been a while since I wrote this (or thought about this ... or used this 😄). Have you confirmed that this is what happens?

jdiaz4302 commented 2 years ago

This may be of interest as confirmation that multiple train/test splits result in discontinuous ~batches~ sequences.

discontinuous_sequence_dates

janetrbarclay commented 2 years ago

Further confirmation if you look at the temps in a single sample (these are observed temps, # in the title is the seg_id) image

jdiaz4302 commented 2 years ago

Using the existing reduce_training_data_continuous function from the river_dl/preproc_utils.py file can help get continuous batches with nan values. For example, here is the 365-day sequence for pretraining and finetuning Ys when I applied it to only the finetuning Y (the gap in the finetuning Y is where the nans have been placed - summer):

continuous_batch

If you apply reduce_training_data_continuous to the x variables, you end up with nan in the predictions and subsequently the loss function. Taking this approach in https://github.com/USGS-R/river-dl/pull/142 by applying reduce_training_data_continuous to only the Y array (and not the pretraining Y or X arrays) led to much worse RMSE (factor of 2). I assume this is because the model is exposed to out of bound x values that have no corresponding out of bound Y values (set to nan) but still associated with the 365-day sequence of other values, so it may lead to some misleading learning.

jds485 commented 1 year ago

I think this issue has been addressed from #218