Interpreting functional performance of estuary salinity model

USGS-R / drb-estuary-salinity-ml

Creative Commons Zero v1.0 Universal

0 stars 4 forks source link

Interpreting functional performance of estuary salinity model #121

Open galengorski opened 1 year ago

galengorski commented 1 year ago

I have been trying to produce some functional performance metrics for the Estuary Salinity ML model and comparing them to the COAWST (hydrodynamic model). I am trying to interpret them, but I would love to get thoughts and discussion. We have COAWST model runs for calendar years 2016, 2018, and 2019. We have run the ML model from 2001-2020, with 2001-2015 as training and 2016-2020 as testing periods. The questions that we are trying to address with functional performance are:

How well do the ML and COAWST models reproduce relationships between input variables and output variables?
Are there critical time scales that either model does/doesn't reproduces well?
Can we use IT metrics to help identify processes that the models are/aren't representing well?

The following are a couple of plots with explanation and questions for discussion. I would love to get your opinions or thoughts when you have a second @jds485 @salme146 @jdiaz4302 as I know you all have different expertise on this.

galengorski commented 1 year ago

trenton_associations_161819

This is a plot of the lagged associations of Trenton discharge with the location of the salt front. Top is pearson r, middle is mutual information, and bottom is transfer entropy. A couple of observations off the bat:

The ML model shows higher magnitude correlation and greater mutual information across all time lags, suggesting that it is relying too heavily on discharge to make predictions of the salt front. Although the magnitudes are off, the structure of the ML and COAWST curves are similar, which is encouraging since COAWST is representing actual physical flow.
Both COAWST and ML model don't reproduce the TE from discharge to salt front location from a time lag of 2-21 days. This suggests that the information that both models are passing between discharge and salt front location relies heavily on the past states of the salt front location. When those are conditioned for (in transfer entropy), then the models are not passing enough information from discharge to salt front location.
There is a conspicuous peak in the observed TE at Day 20. I'm worried that this is spurious for two reasons: 1) A peak in TE at day 20 shows up in the transfer entropy for all predictor variables (temperature, tidal, windspeed, and Schuylkill flow) which doesn't make a lot of physical sense, unless it is representing a single event and 2) When I do the same TE calculation for the full observed dataset 2001-2020, the 20 day time lag doesn't show a peak.

galengorski commented 1 year ago

Going off of the last point, this is the same plot but for windspeed:

windspeed_associations_161819

It shows the same peak at day 20 and it also shows minimal correlation and mutual information, but higher transfer entropy. I would initially expect low values for all three metrics as wind shouldn't have a huge effect on the movement of the salt front

jdiaz4302 commented 1 year ago

How well do the ML and COAWST models reproduce relationships between input variables and output variables?

If you would want to look at the change in predictions as you slide a variable across it's range of values, I briefly explored and found Individual Conditional Expectation (ICE) plots to be pretty easy and useful. They are described here and demo'd on one of our models here. Unlike your IT metrics, this provides no baseline or analog for the observations, but they can be compared to known or expected relationships. This method isn't really made for time series models, but it can be adapted by changing the variable's whole time series or the most relevant/recent values in the time series. Example of what you could generate (where black are different instances and yellow is the average) -

ice_example

I could help with that if it's wanted.

The ML model shows higher magnitude correlation and greater mutual information across all time lags, suggesting that it is relying too heavily on discharge to make predictions of the salt front

This suggests that the information that both models are passing between discharge and salt front location relies heavily on the past states of the salt front location. When those are conditioned for (in transfer entropy), then the models are not passing enough information from discharge to salt front location.

When it comes to offering advice for the ML models, these two points are very hard to reconcile. So, the model relies too heavily on discharge but it is also not using it enough when considering past state information. Is there more nuance that I'm missing here?

One question I have for intuition/clarification is how to interpret the transfer entropy plots increasing with lag - my seeming misconception of those plots is that they're saying (e.g.,) "discharge on day t-0 provides no information when considering the time series of salt front locations, while discharge on day t-20 provides a lot of information when considering the time series of salt front locations"

jdiaz4302 commented 1 year ago

relies heavily on the past states

If there is enough interest, I'm fairly certain we could change up the model code to specifically penalize reliance on the past states

galengorski commented 1 year ago

Thanks for the responses @jdiaz4302

If you would want to look at the change in predictions as you slide a variable across it's range of values, I briefly explored and found Individual Conditional Expectation (ICE)...

This is interesting, so the yellow line here would be comparable to the traditional partial dependence plots? Is that right?

When it comes to offering advice for the ML models, these two points are very hard to reconcile. So, the model relies too heavily on discharge but it is also not using it enough when considering past state information. Is there more nuance that I'm missing here?

This is hard for me to interpret. The difference between MI and TE is that MI represents the information that we learn about the salt front from discharge and the past states of the salt front, while TE conditions out the past salt front states. So maybe the ML model is assuming a relationship between discharge and salt front that is too consistent through time. I am having a hard time explaining that further though...

One question I have for intuition/clarification is how to interpret the transfer entropy plots increasing with lag - my seeming misconception of those plots is that they're saying (e.g.,) "discharge on day t-0 provides no information when considering the time series of salt front locations, while discharge on day t-20 provides a lot of information when considering the time series of salt front locations"

The TE interpretation should really start on day 1. The way I am thinking about it is, "given that we know where salt front was x days ago, what additional information is the discharge from x days ago giving us about today's salt front location." In that light, the general increase in TE from ~1-10 days makes sense, it's saying that the discharge from the last 1-10 days is giving additional information about where the salt front will be than the past salt front states.

jds485 commented 1 year ago

I'm fairly certain we could change up the model code to specifically penalize reliance on the past states

In this case, we would want the model to rely more on the past states, right? Galen: have you tried to include TE in the loss function? Could be an alternative to changing the model code.

So maybe the ML model is assuming a relationship between discharge and salt front that is too consistent through time

If that is the case, do you think it could contribute to the under/over prediction in the tails?

jdiaz4302 commented 1 year ago

This is interesting, so the yellow line here would be comparable to the traditional partial dependence plots? Is that right?

Yes, the PDP is the average of the ICE lines

In this case, we would want the model to rely more on the past states, right?

When reading "the information that both models are passing between discharge and salt front location relies heavily on the past states of the salt front location ... the models are not passing enough information from discharge to salt front location", my take was to rely less on past states and more on current information. But either way, regularization/penalization could be moved as needed.

This idea may be problematic though because the past LSTM states that we can rely more or less on (the h and c vectors) aren't really comparable to values in the TE equation - the previous observations (y at t-n). Because those states are further transformed before becoming the predicted y at t-n (by the dense layer) and those states also include the only information about the variable of interest (e.g., discharge) at earlier points, so it would be penalizing that as well. 🤕

Galen: have you tried to include TE in the loss function? Could be an alternative to changing the model code.

TE loss function would likely be the best direct way to improve TE (by definition of the training/optimization problem). You'll need to find or implement a custom TE loss function that is differentiable for torch or tf; does that sound practical? I don't know the TE equations and implementations well (and I stopped searching for torch implementations when I didn't see any obvious results and saw someone asking if you can take the derivative of a histogram 😅)

If you had a chunk of code you could point me to for those TE calculations, I could weigh in on how easy-gnarly that looks

galengorski commented 1 year ago

Galen: have you tried to include TE in the loss function? Could be an alternative to changing the model code.

TE loss function would likely be the best direct way to improve TE (by definition of the training/optimization problem). You'll need to find or implement a custom TE loss function that is differentiable for torch or tf; does that sound practical? I don't know the TE equations and implementations well (and I stopped searching for torch implementations when I didn't see any obvious results and saw someone asking if you can take the derivative of a histogram 😅)

This is something I have thought about, but I'm a little hesitant to invest too much in it until I can develop a good intuition/interpretation of what transfer entropy is telling us about the system. Adding it to the loss function would probably improve functional performance though. I haven't really dug into what it would take to implement in pytorch though. To calculate TE, you need to estimate pdfs of the two variables and calculate their joint probabilities. Here is a link to the code for calculating transfer entropy (and other IT metrics).