eco4cast / unconf-2023

Brainstorming repo to propose and discuss unconference project ideas!
12 stars 0 forks source link

Using existing forecasts as prior knowledge for studies at non-NEON sites #17

Open brendanwallison opened 1 year ago

brendanwallison commented 1 year ago

One thing that excites me about continental-scale networks like NEON is the potential to inform very local studies. Particularly if there is a large community of practice generating a body of forecasts at NEON sites, it seems that a good way to add value to all of those efforts would be to develop examples showing how to leverage an existing forecasting model as an informed Bayesian prior for a local field study. Doing so would implicitly leverage NEON's big data in order to employ more complex ecological models than would normally be possible for a smaller-scale, data-limited study.

I'm personally interested in using some version of the ground beetle forecasts for this purpose, but really any of the forecasting challenges could make for a good case study.

mintzj commented 1 year ago

What about using local field data to update predictions from models using Bayesian updating? This would be a kind of post-hoc spatial update to existing maps, rather than retraining the model. Is anyone familiar with this kind of approach?

We know all models aren't perfect, but what if the model you want to use is reliably imperfect in your specific region of interest (local bias)? Could you not correct the bias by using local estimates from field data? This is an especially big problem for maps trained at a national scale. They may be unbiased at the training (national) scale, but locally they can be wrong. Field samples could give land managers the power to correct for local bias using the field data they have.

I've been thinking about this in respect to NLCD, in particular the RCMAP rangeland fractional components (shrubs and grasses), but there are others like RAP, MODIS VCF fractional tree cover or other national maps that could be targeted for local correction using NEON field data.

Testudinidude commented 1 year ago

While I am not necessarily familiar with Bayesian updating explicitly, that does seem like the right idea, Jeff. However, I would generally think that the larger issue with using national predictive frameworks to guide local studies lie in the uncertainties around operative spatial scales. I would imagine that some national predictive models might actually do a pretty decent job of predicting local abundance/occupancy (suggesting scalability of model outputs across spatial scales), but many won't. I would think that defining the spatial and temporal boundary conditions in which models perform best would be a worthy (and, perhaps, necessary) corollary to this.

brendanwallison commented 1 year ago

What about using local field data to update predictions from models using Bayesian updating? This would be a kind of post-hoc spatial update to existing maps, rather than retraining the model. Is anyone familiar with this kind of approach?

This is very similar to what I had in mind. True Bayesian updating would be the most rigorous solution, but it seems to have an issue in that it depends on the existing forecast being Bayesian itself. Even worse--someone correct me if I'm wrong about this, as I'm not an expert--but even the forecast model being Bayesian wouldn't be enough. Setting aside nice clean models with analytical posteriors, people generally approximate the posteriors of more complicated models through MCMC. I'm not sure there's a clean way to do Bayesian updating on a stack of MCMC samples. In practice it seems like a problem better solved upstream. If it became standard practice for forecasters to build models with downstream Bayesian updating in mind....that would be nice. I'm not very familiar with what people do now, but it seems that this would have other benefits in terms of being able to continuously update forecasting models in an online fashion. Perhaps people are doing this already?

I also think @Testudinidude has a valid point: the national predictive framework, even if formulated perfectly for Bayesian updating, probably won't accurately capture the uncertainty of a local site or timescale. To pick an arbitrary example, let's say you care deeply about ground beetle abundances in an urban pocket park surrounded by development. The larger amount of data from the NEON sites, with uncertainty based on NEON sites, would then overwhelm the testimony of your hyper-local data in a way that would probably not be helpful.

What I was picturing with this suggestion is essentially Bayesian updating but with a subjective step. If we assume that we'll mostly have point estimates of the model parameters (or distributions that we do not fully trust), we can use our local domain knowledge and/or consult with site experts to formulate our uncertainty around these parameters. What probably distribution do they come from? How do we parameterize the prior? This is essentially what you do anyway each time you build a Bayesian model, including the step of consulting with experts. The only difference is that one of the experts we consult is the national predictive model. This also allows for us to add additional predictors or change model structure, since we are at heart building a new model. Another way of seeing this is as saying that generic solutions are hard, so in the interests of getting something built we'll just leave all that messiness to the experts in this last step.

We know all models aren't perfect, but what if the model you want to use is reliably imperfect in your specific region of interest (local bias)? Could you not correct the bias by using local estimates from field data? This is an especially big problem for maps trained at a national scale. They may be unbiased at the training (national) scale, but locally they can be wrong. Field samples could give land managers the power to correct for local bias using the field data they have.

Yes, absolutely. I've been suggesting that we update the parameters and essentially build a new model, but another route we could go down is to leave the model parameters alone, and simply perform a calibration step. This could be done in any number of ways, Bayesian or not. For example, you could build a time series of residual error and fit a gaussian process to the error, as in https://onlinelibrary.wiley.com/doi/full/10.1111/ele.13728

I think part of the iterative method of ecological forecasting is that people can also start to identify trends in what can drive the error, and then improve the models.

I've been thinking about this in respect to NLCD, in particular the RCMAP rangeland fractional components (shrubs and grasses), but there are others like RAP, MODIS VCF fractional tree cover or other national maps that could be targeted for local correction using NEON field data.

That is an interesting twist. As opposed to using NEON models to inform local models, NEON is the local data correcting the national model. That also seems worth pursuing.

While I am not necessarily familiar with Bayesian updating explicitly, that does seem like the right idea, Jeff. However, I would generally think that the larger issue with using national predictive frameworks to guide local studies lie in the uncertainties around operative spatial scales. I would imagine that some national predictive models might actually do a pretty decent job of predicting local abundance/occupancy (suggesting scalability of model outputs across spatial scales), but many won't. I would think that defining the spatial and temporal boundary conditions in which models perform best would be a worthy (and, perhaps, necessary) corollary to this.

Agreed. I think discussions stemming from your suggested https://github.com/eco4cast/unconf-2023/issues/23 would be an invaluable complement.

I think the defining feature of this project is that we buckle down and build something, even if imperfect. It sounds like this is a relevant job linking many projects. I'm not an expert by any means, so I'm wondering what you all think would be the best way of doing that. I've hopefully done a decent job of communicating my initial mental picture, which is essentially to 1) identify an existing ground beetle forecasting model, 2) identify some ground beetle dataset at a non-NEON site and 3) use the NEON forecast model to inform the priors of the model we build at the new site. However, I'm open to wildly different suggestions. I think the only defining characteristic of this project is that we have code and a new model by the end of it.

mintzj commented 1 year ago

I am on board with the objective to build something even if imperfect. A couple thoughts to add to the list:

One interpretation of your suggestion is to use an overall model that is flexible by site, to fit a beetle distribution when we have a large amount of data. Then when an individual wants to run a local study, they build off the distribution described by the larger model, but tune it to a particular site. Maybe your larger model has a random effect for site. When you want to use it locally, you tune the random effect distribution using local data to get the best local distribution, maybe by setting mean of the distribution.

This has some similarities to the map-bias correction idea. Model-based maps often don't have uniformly random noise. When we cross-validate, we often treat that error like it happens uniformly, because we have points that are very far apart, but up close, there are patterns of bias. We know that for a statistical estimator, MSE = bias^2 + variance. One the map is finalized, like any other estimator we can remove that bias, once we know what it is. Fortunately, bias is often correlated spatially, so using a sample of points (such as NEON field data) we can correct for bias using a spatial model. This method could be applied to any spatial product, to update its performance post-hoc and a field sample from a local site.

Another thing I wanted to mention but am am unsure the connection yet is NEON reminds me of the control in a Before-After Control-Impact study. I am only vaguely familiar with BACI designs but I think it is useful for time-series. By establishing what the overall trend and correlation structure are, can we then use that to better assess what is observed at experimental sites?