Pad train/val/test data

jds485 commented 1 year ago

This PR adds function arguments that allow a user to pad the training, validation, and/or testing datasets so that data are not trimmed. For the inland salinity dataset, this resulted in 5000-20000 more observations used.

I tested this for cases where the train/val/test is a continuous time period. I think if a partition is a discontinuous time period (e.g., training is 2000-01-01 to 2005-01-01 and 2010-01-01 to 2015-01-01), that would require a different coding approach. For example, pad each of the continuous periods within the discontinuous blocks so that batches are defined for each of the continuous periods. That might also address #127. Curious in your thoughts on how is best to approach padding for discontinuous time periods.

Closes: #217

jds485 commented 1 year ago

How should we pad discontinuous partitions?

We could make discontinuous partitions continuous when creating batches by filling in the gap with observed x data and np.nan y data. This assumes that we would always have the observed x data to fill in the gaps, which I think its true. @janetrbarclay notes that if we use this approach, we could end up with batches with no non-nan obs, so maybe then a step is to remove the batches with all nan obs.
- This raises a question: how are cell and hidden states handled from batch to batch in this implementation? It seems like they are passed from the previous batch to the current batch, but I want to double check with @jdiaz4302, @jsadler2, or @SimonTopp
Alternately, we could treat each section of the discontinuous partitions separately for the purposes of padding. Then batches would always start with observed data.
- same question about how hidden and cell states are handled.

jdiaz4302 commented 1 year ago

how are cell and hidden states handled from batch to batch in this implementation? It seems like they are passed from the previous batch to the current batch, but I want to double check

I believe init_states was included because Simon was adding pytorch-compatability for GWN and I shared the pytorch RGCN version that I coded up for DRB forecasting where providing h/c is needed for data assimilation. I don't think anyone uses it in this repo; pretty sure it's the default init_states=None being used. Others should definitely feel free to correct me if this is wrong.

jdiaz4302 commented 1 year ago

I'm super rusty on this repo as a project workflow (i.e., beyond it's general modeling methods), do I understand the original problem correctly:

Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

If I understand it well enough, I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together, instead you'll just have two nan-heavier sequences which sounds more appropriate and realistic to some hypothetical, operational usage.

jds485 commented 1 year ago

do I understand the original problem correctly

Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

yes, that's correct.

I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together

I think option 1 would also achieve this, after dropping any sequences that only have nan data. For a partition with [data1, gap, data2], we would fill in the gap with observed x data and set y to nan in that gap. The difference would be that the nans in data2 could appear at the start of the first and end of the last sequence for option 1, and only at the end of the last sequence for option 2. So, I think there could be fewer overall nans using option 2. If cell and hidden states are not carried over from batch to batch, that's also a good case for option 2

jdiaz4302 commented 1 year ago

Ah, okay. It sounds fairly inconsequential then from my outsider point of view.

I think nan only at the end is slightly preferable. It probably has no impact on current workflow but I could see it being beneficial to autoregressive models that might get nan drivers for the lagged target during the "gap" and it would be an easier assumption to make when handling the data.

jds485 commented 1 year ago

Thanks for your perspectives, @jdiaz4302! I can implement option 2 for this PR

janetrbarclay commented 1 year ago

An issue sounds good to me 🙂

Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108

Phone (office) 860 291-6763 Fax 860 291-6799 Email @.>@*.**@*.***> https://www.usgs.gov/staff-profiles/janet-barclay

From: Jared D. Smith @.> Sent: Friday, May 5, 2023 11:48 AM To: USGS-R/river-dl @.> Cc: Barclay, Janet R @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-R/river-dl] Pad train/val/test data (PR #218)

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

@jds485 commented on this pull request.

In river_dl/preproc_utils.pyhttps://github.com/USGS-R/river-dl/pull/218#discussion_r1186249846:

gap_ind = np.where(time_diff == t)[0]

there can be multiple gaps of the same length

for i in gap_ind:

determine if the gap is longer than the sequence length

date_before_gap = dataset[time_idx_name][i].values

next_date = dataset[time_idx_name][i+1].values

gap length in the same unit as the timestep

gap_length = int((next_date - date_before_gap)/timestep_1)

if gap_length < seq_len:

I originally had this as a sys.exit, but did not want

to force this condition.

print("The gap between this partition's continuous time periods is less than the sequence length")

get the start date indices. These are used to split the

dataset before creating batches

continuous_start_inds.append(i+1)

would we ever want to fill short gaps and not restart the batch?

Maybe. The check for gap_length < seq_len could be used to identify which indices to fill instead of restart the batch after the gap. seq_len could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates.

I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable.

I don't currently need that functionality, so I'd vote to convert to an issue.

— Reply to this email directly, view it on GitHubhttps://github.com/USGS-R/river-dl/pull/218#discussion_r1186249846, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA5H7UAQZ4ANCIY4VQTCW5DXEUOLVANCNFSM6AAAAAAXJWUUIE. You are receiving this because you were mentioned.Message ID: @.***>

janetrbarclay commented 1 year ago

I ran the updates for padding the discontinuous partitions and they ran fine and the padding looks good. When I looked at the predicted temps, however, I saw that there were predictions for the added days. Looking at the updated predict function, I see there's an option to specify the last days, but I hadn't noticed that (and therefore hadn't updated my snakemake file to account for it) and as written it doesn't remove padded days in the middle of the partition (like with discontinuous times).

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not? (additional files in prepped.npz like "times_trn" or 'ids_trn', that is a boolean that tracks padded or not) Since prepped.npz (or whatever name is used for the prepped data file) is passed to predict_from_io_data, the padding flag would always be available (and it's absence could be treated as no-padding)

jds485 commented 1 year ago

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not?

That's a great suggestion! I can look into adding that

jds485 commented 1 year ago

@janetrbarclay This is ready to be tested. I tried with several discontinuous partitions and it worked for me. This is a plot of one of the prediction timeseries. The horizontal line is where there are no data. I did this with latest_time = None (default). I decided to leave that as a function argument in case someone wants to use it

And this is the corresponding indicator of padded data (1 = padded). The x-axis is the index in the array, not the date, so the full length of the data gap is not included in this figure.

USGS-R / river-dl

Pad train/val/test data #218

there can be multiple gaps of the same length

determine if the gap is longer than the sequence length

gap length in the same unit as the timestep

I originally had this as a sys.exit, but did not want

to force this condition.

get the start date indices. These are used to split the

dataset before creating batches