DOI-USGS / streamMetabolizer

streamMetabolizer uses inverse modeling to estimate aquatic metabolism (photosynthesis and respiration) from time series data on dissolved oxygen, water temperature, depth, and light.
http://usgs-r.github.io/streamMetabolizer/
Other
36 stars 22 forks source link

Signal strength, Error assumptions, and MCMC convergence #390

Open Tzu-Yao opened 4 years ago

Tzu-Yao commented 4 years ago

Hi,

I'm using State-Space Bayesian Partial Pooling approach for my model and I have a few questions:

  1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?

  2. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.

  3. To see if the chains in MCMC converged, I was looking for Rhat of standard deviation of K600 in the model outputs but failed to find it in get_fit (I did see Rhat for K600 mean and K600 predlog) or other functions. Does the model provide such information?

Thank you so much, Tzu-Yao

robohall commented 4 years ago

Comments below:

On May 28, 2020, at 2:28 PM, Tzu-Yao notifications@github.com<mailto:notifications@github.com> wrote:

Hi,

I'm using State-Space Bayesian Partial Pooling approach for my model and I have a few questions:

  1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?

It does not . R2 is difficult to calculate with multilevel models. Yes you could simply estimate the residual variance vs the model variance.

https://statmodeling.stat.columbia.edu/2019/04/23/r-squared-for-multilevel-models-2/

1.

  1. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.

Low diel change in O2 is deadly for simultaneously estimating gas exchange. That is because there needs to be lots of diel swing to the O2 to help define the rate of gas exchange. I do not know what you mean by modify the process error? Impose a different prior on it?

1.

  1. To see if the chains in MCMC converged, I was looking for Rhat of standard deviation of K600 in the model outputs but failed to find it in get_fit (I did see Rhat for K600 mean and K600 predlog) or other functions. Does the model provide such information?

The output will have as part a stanfit object and the rhats should be in there. You are correct to check for those for all of the parameters. http://usgs-r.github.io/streamMetabolizer/articles/models_bayes.html

and see the stan website for what is in a stanfit object. https://mc-stan.org

The math and reasoning behind Bayesian models are difficult (for me at least). This is the book that I use to both learn for myself and teach from. It is especially good with hierarchical models

https://www.amazon.com/Bayesian-Models-Statistical-Primer-Ecologists/dp/0691159289

1.

Thank you so much, Tzu-Yao

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/USGS-R/streamMetabolizer/issues/390, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC4CU5S3TO6ZDFW6DZZ3ASLRT3CODANCNFSM4NNM3TAA.

Tzu-Yao commented 4 years ago

Hi Bob,

  1. Does the model provide information of coefficient of determination, R2det? If it does, how can I access it?
  1. When dealing with data with low metabolism signal (low diel change in DO), I found that the process error terms made the DO fluctuation even less. I have tried completely disregarding process error (setting "err_proc_iid = FALSE" in mm_name) or modified "err_proc_iid_sigma_scale" in specs, but it turned out that process errors were either 0 or still large. I'm wondering if there's any way that I can modify the standard deviation of process error (err_proc_iid_sigma?) in my partial pooling model.

My model specifications: model_name b_Kb_oipi_tr_plrckm.stan
engine stan
split_dates FALSE
keep_mcmcs TRUE
keep_mcmc_data TRUE
day_start 5
day_end 29
day_tests full_day, even_timesteps, complete_data, pos_discharge
required_timestep NA
K600_lnQ_nodes_centers -10, -9.8, -9.6, -9.4, -9.2, -9, -8.8, -8.6, -8.4, -8.2, -8, -7.8, -7.6, -7.4, -7.2, -7, -6.8,... GPP_daily_mu 3.1
GPP_daily_lower -Inf
GPP_daily_sigma 6
ER_daily_mu -7.1
ER_daily_upper Inf
ER_daily_sigma 7.1
K600_lnQ_nodediffs_sdlog 0.1
K600_lnQ_nodes_meanlog 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788, 2.484906649788... K600_lnQ_nodes_sdlog 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32, 1.32... K600_daily_sigma_sigma 0.24
err_obs_iid_sigma_scale 0.03
err_proc_iid_sigma_scale 1e-05
params_in GPP_daily_mu, GPP_daily_lower, GPP_daily_sigma, ER_daily_mu, ER_daily_upper, ER_daily_sigma, K... params_out GPP, ER, GPP_daily, ER_daily, K600_daily, K600_daily_predlog, lnK600_lnQ_nodes, K600_daily_sig... n_chains 4
n_cores 4
burnin_steps 500
saved_steps 500
thin_steps 1
verbose FALSE

get_fit(mm) %>% lapply(names)

`$daily [1] "date" "GPP_mean" "GPP_se_mean" "GPP_sd"
[5] "GPP_2.5pct" "GPP_25pct" "GPP_50pct" "GPP_75pct"
[9] "GPP_97.5pct" "GPP_n_eff" "GPP_Rhat" "ER_mean"
[13] "ER_se_mean" "ER_sd" "ER_2.5pct" "ER_25pct"
[17] "ER_50pct" "ER_75pct" "ER_97.5pct" "ER_n_eff"
[21] "ER_Rhat" "GPP_daily_mean" "GPP_daily_se_mean" "GPP_daily_sd"
[25] "GPP_daily_2.5pct" "GPP_daily_25pct" "GPP_daily_50pct" "GPP_daily_75pct"
[29] "GPP_daily_97.5pct" "GPP_daily_n_eff" "GPP_daily_Rhat" "ER_daily_mean"
[33] "ER_daily_se_mean" "ER_daily_sd" "ER_daily_2.5pct" "ER_daily_25pct"
[37] "ER_daily_50pct" "ER_daily_75pct" "ER_daily_97.5pct" "ER_daily_n_eff"
[41] "ER_daily_Rhat" "K600_daily_mean" "K600_daily_se_mean" "K600_daily_sd"
[45] "K600_daily_2.5pct" "K600_daily_25pct" "K600_daily_50pct" "K600_daily_75pct"
[49] "K600_daily_97.5pct" "K600_daily_n_eff" "K600_daily_Rhat" "K600_daily_predlog_mean"
[53] "K600_daily_predlog_se_mean" "K600_daily_predlog_sd" "K600_daily_predlog_2.5pct" "K600_daily_predlog_25pct"
[57] "K600_daily_predlog_50pct" "K600_daily_predlog_75pct" "K600_daily_predlog_97.5pct" "K600_daily_predlog_n_eff"
[61] "K600_daily_predlog_Rhat" "valid_day" "warnings" "errors"

$inst [1] "date" "solar.time" "err_obs_iid_mean" "err_obs_iid_se_mean" "err_obs_iid_sd"
[6] "err_obs_iid_2.5pct" "err_obs_iid_25pct" "err_obs_iid_50pct" "err_obs_iid_75pct" "err_obs_iid_97.5pct" [11] "err_obs_iid_n_eff" "err_obs_iid_Rhat" "err_proc_iid_mean" "err_proc_iid_se_mean" "err_proc_iid_sd"
[16] "err_proc_iid_2.5pct" "err_proc_iid_25pct" "err_proc_iid_50pct" "err_proc_iid_75pct" "err_proc_iid_97.5pct" [21] "err_proc_iid_n_eff" "err_proc_iid_Rhat"

$overall [1] "date_index" "time_index" "index" "err_obs_iid_sigma_mean"
[5] "err_obs_iid_sigma_se_mean" "err_obs_iid_sigma_sd" "err_obs_iid_sigma_2.5pct" "err_obs_iid_sigma_25pct"
[9] "err_obs_iid_sigma_50pct" "err_obs_iid_sigma_75pct" "err_obs_iid_sigma_97.5pct" "err_obs_iid_sigma_n_eff"
[13] "err_obs_iid_sigma_Rhat" "err_proc_iid_sigma_mean" "err_proc_iid_sigma_se_mean" "err_proc_iid_sigma_sd"
[17] "err_proc_iid_sigma_2.5pct" "err_proc_iid_sigma_25pct" "err_proc_iid_sigma_50pct" "err_proc_iid_sigma_75pct"
[21] "err_proc_iid_sigma_97.5pct" "err_proc_iid_sigma_n_eff" "err_proc_iid_sigma_Rhat" "lpmean"
[25] "lp
se_mean" "lpsd" "lp2.5pct" "lp25pct"
[29] "lp
50pct" "lp75pct" "lp97.5pct" "lp_neff"
[33] "lp
Rhat"

$KQ_overall [1] "date_index" "time_index" "index" "K600_daily_sigma_mean"
[5] "K600_daily_sigma_se_mean" "K600_daily_sigma_sd" "K600_daily_sigma_2.5pct" "K600_daily_sigma_25pct"
[9] "K600_daily_sigma_50pct" "K600_daily_sigma_75pct" "K600_daily_sigma_97.5pct" "K600_daily_sigma_n_eff"
[13] "K600_daily_sigma_Rhat"

$KQ_binned [1] "date_index" "time_index" "index" "lnK600_lnQ_nodes_mean"
[5] "lnK600_lnQ_nodes_se_mean" "lnK600_lnQ_nodes_sd" "lnK600_lnQ_nodes_2.5pct" "lnK600_lnQ_nodes_25pct"
[9] "lnK600_lnQ_nodes_50pct" "lnK600_lnQ_nodes_75pct" "lnK600_lnQ_nodes_97.5pct" "lnK600_lnQ_nodes_n_eff"
[13] "lnK600_lnQ_nodes_Rhat"

$warnings NULL

$errors NULL `

Again, thanks so much, Tzu-Yao

aappling-usgs commented 4 years ago

Hi Tzu-Yao, take out the lapply(names) from your get_fit() command.

aappling-usgs commented 4 years ago

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

aappling-usgs commented 4 years ago

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

Tzu-Yao commented 4 years ago

I think DO_R2 may indeed be what you're looking for - check out the formula at https://github.com/USGS-R/streamMetabolizer/blob/master/inst/models/b_np_oipi_tr_psrckm.stan#L142 and note that you probably need to add it to your params_out specification to be able to find it with get_fit()

Hi Alison,

That's exactly what I was looking for. Thanks! I tried to add "DO_R2" to the params_out in specs but only got warnings/errors and it couldn't run:

my specs after I added "DO_R2:"

bayes_specs$params_out [1] "GPP" "ER" "DO_R2" "GPP_daily" "ER_daily"
[6] "K600_daily" "K600_daily_predlog" "lnK600_lnQ_nodes" "K600_daily_sigma" "err_obs_iid_sigma" [11] "err_obs_iid" "err_proc_iid_sigma" "err_proc_iid"

The warnings/errors: $warnings [1] "some chains had errors; consider specifying chains = 1 to debug" [2] "Stan model 'b_Kb_oipi_tr_plrckm' does not contain samples." In Viewer it showed: "no parameter DO_R2; sampling not done"

Thank you, Tzu-Yao

aappling-usgs commented 4 years ago

Try updating the package with

remotes::install_github('USGS-R/streamMetabolizer')

(because DO_R2 is a relatively new parameter)

Tzu-Yao commented 4 years ago

Thanks Alison,

I updated the streaMetabolizer with remotes::install_github('appling/unitted') remotes::install_github("USGS-R/streamMetabolizer") and can now access DO_R2 info.

However, I realized that DO_R2 provided in the model is the R2 of DO observed relative to the modeled, instead of DO observed relative to the deterministic, R2det, as described in Appling, Hall, Yackulic& Arroita, 2018. The former takes into account both observation errors and process errors while the latter does not include process error correction and thus enable me to see how strong the signal is before corrected with process errors.

I wonder if the model also provides R2det if that makes sense?

Thank you so much, Tzu-Yao

aappling-usgs commented 4 years ago

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Tzu-Yao commented 4 years ago

Thank you Alison and Bob. Your answers/comments are really helpful!

Tzu-Yao commented 4 years ago

err_proc_iid_sigma_scale is the only/primary lever for affecting the magnitude of the fitted process errors. I'd like @robohall 's thoughts on this too, but I think we see the models appear to ignore this parameter when the other parameters that can be adjusted (e.g. GPP_daily or Pmax/alpha, ER_daily, K600_daily, and err_obs_iid_sigma) are too tightly constrained, or when there are unmodeled processes or process dynamics occurring in the ecosystem (e.g., drifting clouds or spatial heterogeneity along the reach), such that there's just no better (higher likelihood) way for the model to reproduce the shape of the observed DO curve than to call it process error.

Hi Alison, I think I'm still a little confused about the process error manipulation. Like you said, it seems that the only parameter I can adjust in the model to change process error is err_proc_iid_sigma_scale. However I found that in Appling, Hall, Yackulic& Arroita, 2018 the experiment in 2.2.2 actually controlled the process error to be a constant and I wonder if I could do the same thing (i.e. let sigma of process error be a constant)? Thanks.

aappling-usgs commented 4 years ago

Hi Tzu-Yao, are you looking at this text? image If so, I think the needed clarification may be that we fixed process error (err_proc_iid_sigma) in the simulations that generated DO "data". We still fitted err_proc_iid_sigma when running the Bayesian models.

Tzu-Yao commented 4 years ago

Yeah, that was what I was looking at. Thank you for the clarification!

Tzu-Yao commented 4 years ago

Ah, thanks for the reminder - amazing how much I can forget after writing a whole paper on something! The predict_DO() function should return a data.frame with a column called DO.pure that is the predictions without process error correction. You can then compute R2 for DO.pure relative to DO.obs.

Hi @aappling-usgs, sorry it's been a while. I guess I'm still confused about this...

I used the predict_DO function and it only outputs DO.obs and DO.mod. I'm not sure but it seems that DO.pure is only provided in "sim model" when simulating data?

A part of the predict_DO output: date solar.time DO.obs DO.sat depth temp.water light discharge DO.mod 100 2017-07-01 2017-07-01 10:38:24 3.00 7.370137 0.8824491 31.79 763.6409 0.06887039 2.443049 101 2017-07-01 2017-07-01 11:08:24 3.34 7.337867 0.8844491 32.04 788.6277 0.07064655 2.532592 102 2017-07-01 2017-07-01 11:38:24 3.55 7.298487 0.8824491 32.35 802.4943 0.06766137 2.619626 103 2017-07-01 2017-07-01 12:08:24 3.60 7.292322 0.8824491 32.39 805.0035 0.06393055 2.702070 104 2017-07-01 2017-07-01 12:38:24 3.47 7.309420 0.8824491 32.24 796.1124 0.06732575 2.778523 105 2017-07-01 2017-07-01 13:08:24 3.67 7.286094 0.8824491 32.42 775.9731 0.06404350 2.847369

Also, while I was analyzing the errors in my model (on low signal and high noise data), I found that if I disregard process error (i.e. observation-error model, Fig. 2), the DO.mod actually fit the DO.obs better. In this case, can I say that observation-error model works better than the state-space model because the former is better at recognizing the DO signal? I think one way to see which is better is to look at if the process error term improves the autocorrelation of residuals (and this brings me to how I can access DO with and without process error correction)?

DO_plots_node0 2_nodediff0 1_1000burnin Fig. 1. A state-space model

DO_plots_node0 2_nodediff0 1_burnin1000_errproc0 Fig. 2. An observation-error model

Thank you, Tzu-Yao