No pretraining exp - Githubissues

jdiaz4302 commented 2 years ago

⚠️ Definitely not merge worthy as-is; documenting experiment ⚠️

Regarding #38

What happens here:

Compare 5 model runs with process-based (PB) pretraining and 5 runs of using PB outputs as inputs
Training hyperparameters:
- Training partition is non-summer months between (1984, 2010) and (2015, 2020); this was an artifact of the existing config.yml
- Likewise, validation partition is all months between (1979, 1984), (2010, 2015), and (2020, 2021); the last interval appears to only have 2 days of data.
100 fine tuning epochs and either 0 or 200 pretraining epochs (manually changed in config.yml between runs if needed)
- If 0 pretraining epochs, concatenate the PB outputs to the x variable arrays prior to training and rewrite that file (prepped.npz -> prepped2.npz)
- After all of this, it occurred to me that the PB pretraining runs get 300 total training updates, while the PB inputs get 100, so I also generated one run with 300 fine tuning epochs and 0 pretraining epochs; I would've done more but Tallgrass had long queues.

Once a run is completed, I copy the output directory to a separate location and rerun (possibly after changing the pretraining epochs); e.g., cp -r output_DRB_offsetTest/ no_pretrain_1_300/.

After all the runs were done, I made some plots of the learning/training curves, validation set RMSE by month (adjusted to bin "months" by the 21st date of each month which better aligns with defining summer by equinox dates), time series, and I'm starting to look at the outputs vs input plots (will upload soon). I will add the notebooks to generate those plots shortly (need to clean them up).

Right now this looks very favorable for using PB pretraining (specifically validation set RMSE by month), but maybe too favorable? It would be nice to get some more eyes to spot any mistakes or oversights. One thing that is definitely strange is the all-over-the-place behavior of PB input models during validation set summers (see last two plots).

Plots:

TrainingCurves_PB_inputs_vs_pretraining

RMSE_by_month_PB_inputs_vs_pretraining (1)

PB_experiment_TimeSeries1566_300IDd

PB_experiment_TimeSeries1573_300IDd

jdiaz4302 commented 2 years ago

Also, the config.yml that I used was apparently inbetween comments. I did not make all the changes that the differences would suggest; I only changed the training partition start/end dates and the number of epochs.

aappling-usgs commented 2 years ago

@jsadler2 any chance you have time to review, or willingness to punt quickly to @SimonTopp if not? Jeremy pointed out the big concern - it's weird that the model predictions are all right on top of one another from 2009-09 to 2010-06 and then suddenly all over the map in the summer (and even a bit in the winter) once the validation period starts. Could be a bunch of things, including but not limited to:

this is actually how it works, and pretraining just knocks fine tuning out of the park, and the validation data is enough out-of-sample that variability in both the pretraining and the PB-as-input predictions is wider during val
something funny about how training and val partitions are getting defined and munged? see https://github.com/USGS-R/river-dl/issues/38#issuecomment-959673155
something else we haven't thought of yet

jsadler2 commented 2 years ago

I can look at this this afternoon

janetrbarclay commented 2 years ago

How has it been done previously to train on the winter data and test on the summer? I wonder if shortening the training data like this is causing weird jumps in the data (such that the model thinks that Sept 23 comes immediately after June 19). I have a 2 PM (eastern) meeting but can look a little more afterwards.

SimonTopp commented 2 years ago

I think Janet might be onto something. Here we're cutting up all our observations by start and end date. https://github.com/USGS-R/river-dl/blob/b716770cbbe84c32933f6505a07b118092c14dca/river_dl/preproc_utils.py#L46-L75

Then here we're taking the resulting sequences and slicing them by 365 which is assuming that continuous years are being passed to it. I think I walked through all this when I made issue #127.

https://github.com/USGS-R/river-dl/blob/b716770cbbe84c32933f6505a07b118092c14dca/river_dl/preproc_utils.py#L120-L139

I think what we want to be doing here is masking out the summer months rather than excluding them in the start/end dates. Maybe using the exclude file (might need some work after the big update a couple months ago) or by using reduce_training_data_continuous on line 348.

jdiaz4302 commented 2 years ago

Good catch!

Interestingly, this training set processing applies to both groups in the experiment, so I wonder what that implies.

That is, both groups are given a discontinuous sequence of inputs values, but only the PB inputs are seemingly affected while PB pretraining seem to handle it. It could be that PB pretraining de-emphasizes long-term information that is more likely to be from a discontinuous interval and/or further evidence that PB inputs is overfitting to the training data rather than learning valuable relationships (i.e., pretraining may facilitate being "right for the right reasons").

SimonTopp commented 2 years ago

I was just thinking about that @jdiaz4302 . I would expect the discontinuous sequences to decrease accuracy across the board, but we still see pretty decent results from the pre-training which is surprising. Am I right that the pre-training here has the some breaks as the training dataset? If so, it's bonkers that it can still learn annual signals.

jdiaz4302 commented 2 years ago

Am I right that the pre-training here has the some breaks as the training dataset?

Yep!

I'm assuming with a discontinuous 365-day sequence, you could often still reliably use (e.g.) the last 2 weeks of data and learn certain variable relationships with less focus on long-term temporal dynamics.

SimonTopp commented 2 years ago

you could often still reliably use (e.g.) the last 2 weeks of data and learn certain variable relationships with less focus on long-term temporal dynamics

I've found similar things with the GraphWaveNet model I've been developing, but that would imply that there's relatively little worthwhile information beyond ~1-2 months in a sequence. Might be interesting to run some tests with different sequence lengths and see at what point (how short) the model sees a drop in performance from loss of temporal info. Also, should have said this off the bat, very cool work and great visualizations man!

aappling-usgs commented 2 years ago

Am I right that the pre-training here has the some breaks as the training dataset?

Let's fix this if we can! We've hypothesized that a lot of the pretraining benefit is in getting to see predictions for conditions under which the model doesn't get see any observations (in this case, for summertimes). So maybe we can get the pretraining results even better, justifiably, by adding those back in.

both groups are given a discontinuous sequence of inputs values, but only the PB inputs are seemingly affected while PB pretraining seem to handle it. It could be that PB pretraining de-emphasizes long-term information that is more likely to be from a discontinuous interval

Any ideas on what the mechanism would be for this? I would think the PB inputs approach would have a better shot at learning this since it could learn to rely on the PB input more heavily (which does integrate memory across that missing period) whereas the pretraining approach has no such pseudo-memory to rely on.

I've seen a handful of (informal) HPO exercises looking at sequence length for such problems, and people generally settle on ~176 or 365 days. But I bet it varies by region, and I wonder if memory just isn't that important over the summer in these reaches b/c snow is long gone by June and drought is rarely severe. I wouldn't mind seeing this experiment done again for the DRB but also don't see it as a very high priority.

and/or further evidence that PB inputs is overfitting to the training data rather than learning valuable relationships (i.e., pretraining may facilitate being "right for the right reasons").

This explanation seems more plausible to me.

jdiaz4302 commented 2 years ago

@jsadler2

PB-input or PB-pretraining is triggered by the number of pt_epochs in the config.yml file. I manually edit this when I want to get runs for a different group (0 = PB-input; not 0 = pretraining). Here I concatenate the pretraining Ys (i.e., PB outputs) to the x_{partition}_obs if there is no pretraining. The data is wrote to a different .npz file that is used at later points (prepped2.npz).

SimonTopp commented 2 years ago

Let's fix this if we can! We've hypothesized that a lot of the pretraining benefit is in getting to see predictions for conditions under which the model doesn't get see any observations (in this case, for summertimes). So maybe we can get the pretraining results even better, justifiably, by adding those back in.

This is updated in my most recent PR (granted in kind of a bulky way), but you could pull it from there if you wanted. It basically just creates an x_pre_train and y_pre_train that include everything (all partitions and process outputs).

janetrbarclay commented 2 years ago

Since Jeff and Simon are deep into this review already, I'll follow the conversation and comment if I think of something, but mostly let them dig into the code.

jsadler2 commented 2 years ago

Mmk. I'm pretty sure I know what is going on here as far as why the predictions for PB-input is so wonky in the validation phase and not in the training. The Training Y values are normalized and the Validation ones are not. https://github.com/USGS-R/river-dl/blob/b716770cbbe84c32933f6505a07b118092c14dca/river_dl/preproc_utils.py#L584

aappling-usgs commented 2 years ago

Yikes, good catch! Is that only the case in this PR, or has it been that way in recent code as well?

jsadler2 commented 2 years ago

It's always been like that. There's been no need to normalize Y_tst and Y_val before. Do you think that we should normalize all Y? I think often no Y partition is normalized. It's just when doing the multi-variable predictions it is needed.

jsadler2 commented 2 years ago

@jdiaz4302 - I think if you scaled and centered y_val_pre and y_tst_pre in these lines, we'd see much more reasonable results.

jsadler2 commented 2 years ago

BTW - the only reason I thought of this as quickly as I did is because I did basically the same thing for an experiment for my multi-task paper 😄

I kept thinking "how are the training predictions okay and the val/test predictions terrible?!" and then I realized that the model was being trained on variables that were on a total different scale than what what I was giving them in the val/test conditions.

jdiaz4302 commented 2 years ago

😮 Haha, thanks @jsadler2. I agree there's generally no reason to scale the validation and testing set observations since you're usually going to just use those for evaluation at the scale-of-interest.

PB_outputs_scale

aappling-usgs commented 2 years ago

That's a great find, Jeff. Experience and team communication paying off big time here.

Combining the scaling fix (💯 Jeff!), discontinuity fix (💯 Janet and Simon!), and pretraining fix (💯 Simon), I feel a lot more optimistic that a next run of these models could give us a correct result. Sweet!

SimonTopp commented 2 years ago

It's always been like that. There's been no need to normalize Y_tst and Y_val before. Do you think that we should normalize all Y? I think often no Y partition is normalized. It's just when doing the multi-variable predictions it is needed.

If you do pull from the preproc_utils changes in the other PR, be aware that I changed it to scale y_pre_trn, y_trn, and y_val because I updated the training routine to validate at each epoch

aappling-usgs commented 2 years ago

Sounds like we need somebody to review & merge Simon's PR soon!

janetrbarclay commented 2 years ago

I can take a look at Simon's PR tomorrow.

jdiaz4302 commented 2 years ago

Results with scaling fix:

RMSE_by_month_PB_inputs_vs_pretraining_FIXEDSCALE PB_experiment_TimeSeries1566_300IDd_SCALEDFIXED

PB_experiment_TimeSeries1573_300IDd_SCALEFIXED

Regarding the discontinuity fix, I tried using a modified version reduce_training_data_continuous (to handle multiple intervals) in place of separate_trn_tst, but then there's nan values in the model's input data (i.e., x_{partition}) which mess up training. Can easily set those to a fixed value, but that's not good. May replace nan with some sort of average for that day of the year (maybe by river segment). To clarify, none of those discontinuity fixing ideas are implemented with these results.

jzwart commented 2 years ago

much different than before. To clarify, the heatmap is from all segments modeled, correct? not just the two segments with time series plots

jdiaz4302 commented 2 years ago

Yes, the heatmap is from all segments. Definitely different, but that should be expected given the previous results were generated with validation set variables that had the wrong scale

SimonTopp commented 2 years ago

Super interesting Jeremy. So basically, even though the PB input runs saw nothing that resembles summer, they're able to generalize to summer conditions better than models that were at least pre-trained with summer months included? Also, did we confirm our discontinuous training sequences?

jdiaz4302 commented 2 years ago

There's no pretraining of summer months included here yet - didn't want to duplicate efforts.

I posted a graph at https://github.com/USGS-R/river-dl/issues/127 showing that we do have discontinuous batches

jordansread commented 2 years ago

Interesting. Note in our lake modeling paper, we also looked at the impact of skipping exposure to summer conditions in pre-training:

jdiaz4302 commented 2 years ago

I messed up some of the version control associated with this, so to clarify, since the last update, I didn’t make any changes to:

Docker_README.md
Snakefile
Snakefile_gw
config.yml
config_gw.yml
river_dl/loss_functions.py

What I did make meaningful changes to were:

river_dl/preproc_utils.py
- To get continuous batches (with nan), I took Simon’s suggestion of using reduce_training_data_continuous. I used this function in place of separate_trn_tst. This means that all the partitions are the same size (full data range) but with varying numbers of nan (associated with other partitions).
- This definitely isn’t optimal; fairly certain it (expectedly) made training take longer.
- I only applied this to the Y observations so that the other vectors (y_pre_{partition} and x_{partition}) would still have valid/real value . This means there are x_trn observations from summer, but no associated y observations from summer to learn from - if I sat those x to nan there's a separate problem that the nan propagate to predictions and loss, then do we set to 0, interpolate, etc... Likewise, y_pre_trn keeps all its values, meaning that when I did pretraining, this included pretraining of summer data as well as validation and test set dates.
- I made some changes to filter_reduce_dates that allowed it to use a list of discontinuous dates and perform as I expected; I expected it to ultimately keep the dates it identified, but it was setting those to nan – hence the np.logical_not
river_dl/train.py
- Mostly the same as before, just scaling y_pre_val/tst

Figures

Figure showing the continuous batch of y_obs_trn with nan and y_pre_trn with data:

continuous_batch

Figure showing latest performance heatmap. I found it strange that performance took a strong hit from using the continuous batches with nan opposed to discontinuous batches with real values, but yeah... it could be misleading to provide the summertime X values and not provide a learning target for them; literally tells the optimization task, "You can do whatever here/in the summer as long as you get your act together by fall":

RMSE_by_month_PB_inputs_vs_pretraining (2)

Time series for reservoir impacted and not impacted stream:

PB_experiment_TimeSeries1566_NoDiscontAllPretrain

PB_experiment_TimeSeries1573_NoDiscontAllPretrain

I'll include these plots of input versus output as well (since I made them), but I didn't find them incredibly insightful (colors are the same; kinda interesting that it seems to taper the effect of higher PB values):

PB_inputs_response_plots (2)

I'm likely going to be helping more on the reservoir task starting next week, and like I said, this was not designed to merge with the existing codebase - more an exploratory tangent. Feel free to close out or maybe I will sometime next week when engagement is practically dead.

Also, thanks @jsadler2 for the better approach! I just didn't have time to learn it and get the results, but I will definitely be reviewing it before trying to take on a deeper snakemake-affliated task

jzwart commented 2 years ago

Interesting. I find it a bit surprising that both methods are overpredicting temperature by quite a bit during the summer periods even though they didn't see any forcing data in that range. Do you know if it's overpredicting at all segments?

jdiaz4302 commented 2 years ago

@jzwart here's a plot of all observations (x-axis) versus predictions (y-axis); these look approximately the same across models and runs. It does seem like that's the general trend. Made some low-effort quadrants via dashed lines to try to discern summer (upper left quadrant - above 25 Celsius). Solid line is 1:1

preds_vs_obs_PB_experiment

Seeing data adjacent to summer (when temperatures are changing faster) may suggest that summer will peak higher than it does (i.e., a sharper rather than rounder parabola)?

SimonTopp commented 2 years ago

I used this function in place of separate_trn_tst.

This seems like a relevant conversation to be had and maybe a good task to assign to someone for a new PR. We should probably make sure our pipeline is creating continuous sequences and has the flexibility mask out certain observations within those sequences for experiments like this. I know @jsadler2 mentioned he had some ideas for an upcoming PR, maybe we should put this on the to-do list?

Also, at least in these reaches it looks like our high temp bias is in the training predictions as well. I feel like that might be an indication that it could be something wrong with our data prep rather than an issue generalizing to the unseen summer temps. What do you think @jdiaz4302?

jdiaz4302 commented 2 years ago

The red box annotations are a good point. It's possible, but I don't necessarily suspect that something is wrong with the data prep.

In my experience, it's not uncommon for there to be under/overestimating at the low ends and over/underestimating at the high ends (note opposite order with respect to "/") because then performance at the central/median/mean values are still optimized. These plots do seem overly skewed to not performing at the high ends, but a density view of the plot seems to show that the low end is far more weighted (same plot as above, but plt.hexbin)

preds_vs_obs_PB_experiment_DensityEst

It's possible that additional variables/missing context could help reel in those low and high ends to the 1:1 line though. RMSE definitely optimizes with respect to the central values, but I've never had luck fixing this problem by using a different generic loss function.

jsadler2 commented 2 years ago

Something that I find really interesting, and @jdiaz4302 brought this up when he first posted this, is the shapes of the inputs vs the outputs for the temp-related inputs... especially the seg_tave_gw. They all have this unusual pattern where it's a little like the "quiet coyote" shape :) - It goes up kind of linearly at the bottom of the range of inputs, but then at the top it kind of splits where some of the points keep going up and some level off and sometimes go down.

I'm scratching my head. Why would the model learn that? Shouldn't increasing air temps (for example) always lead to higher water temps? There is the factor of the reservoirs, but, if I understand these sites correctly, only some of them are influenced by the reservoir. And why would sometimes they go up and sometimes they go level and sometimes they go down?

The gw one is especially interesting because there is also this vertical line when the input is zero. And to me that just seems really weird and like there is some kind of mistake in the model. But again, I'm scratching my head .... no ideas so far as what it might be.

jdiaz4302 commented 2 years ago

This seems like a relevant conversation to be had and maybe a good task to assign to someone for a new PR. We should probably make sure our pipeline is creating continuous sequences and has the flexibility mask out certain observations within those sequences for experiments like this. I know @jsadler2 mentioned he had some ideas for an upcoming PR, maybe we should put this on the to-do list?

Yeah, I think the implicit assumption for a standard LSTM/RNN architecture is that values are evenly spaced/sampled in time. There are variants (e.g., Time-LSTM, Time Aware LSTM) that explicitly require the time between values as an input and easily allow a discontinuous segment, but those are probably better suited for truly uneven time series rather than an actually even time series with big chunks missing - also, its effort into new models, so a masking approach seems like the most applicable in these cases.

@jsadler2 I think those plots are really cool for the same reasons 😄 . It could only be possible because of some interacting effects, ~and with the vertical lines, I'm assuming some interacting effect that's coupling air temperature with a binary variable.~

Less confidence in that ~last speculation~ because the more I think about that vertical bar the more my head kinda hurts too - "At (seemingly) exactly average air temperature, let's occasionally predict uncharacteristically low values"

jdiaz4302 commented 2 years ago

While finding a storage place for this work and testing the storage place, I found that the output versus input plots were specific to segment 1566 (reservoir-impacted); this is the same segment as the reservoir impacted time series throughout this PR (not labelled, but obvious by the spiky summer behavior in those time series plots).

Here are the corresponding output versus input plots for 1573 (the not-reservoir-impacted time series segment; I used 1566 and 1573 because they had tons of data). I think it's really interesting that these plots are a lot more straightforward - less of those "quiet coyote" shapes, as Jeff pointed out. Also, the relationship between prediction and PB output (last row) is a lot more monotonic but still noisy/spread (I believe we expect the PB model to be more reliable away from reservoirs), could be motivation for further refining the PB model for reservoirs.

PB_inputs_response_plots_1573

Here is the same plot for all segments. Overplotting doesn't really resolve even with decimal-point (e.g., 0.01) alpha and marker size. Generally the overall out vs in plots seem to more closely resemble the corresponding 1573 plot, probably because most segments aren't so directly impacted by reservoirs as 1566. There's definitely a lot more spread added when considering the whole data set though.

PB_inputs_response_plots_ALL

My plan is to close (and not merge) this PR by the end of the work day just to clean up and it will still be present in the "closed" tab for reference.

I've stored all the output directories generated by this experiment in the newly created pump project space that Alison announced under river-dl-PB_experiment. The first round of results are simply {no_}pretrain_{n}; the second round (only affecting PB inputs) is stored as no_pretrain_{n}_PretrainScaled; and this final round is stored as {no_}pretrain_{n}_NoDiscontAllPretrain

USGS-R / river-dl

No pretraining exp #142

Figures