DOI-USGS / streamMetabolizer

streamMetabolizer uses inverse modeling to estimate aquatic metabolism (photosynthesis and respiration) from time series data on dissolved oxygen, water temperature, depth, and light.
http://usgs-r.github.io/streamMetabolizer/
Other
35 stars 22 forks source link

Setting priors for K #373

Open samlbickley opened 5 years ago

samlbickley commented 5 years ago

I have previously measured reaeration rates that I would like to use as priors, but my problem is I don't know how to set that in the code. I have a k rate of 66/day. In issue #367, Bob Hall said to use the following specs:

K600_daily_meanlog_meanlog = 0 K600_daily_meanlog_sdlog =1

I don't have a background in Bayesian models, so I don't quite understand what these specs mean. In issue #367 , I believe the person had very low k600, so I interpret the specs given by Bob would limit the modeled k600 between 0 and 1. Is that what's going on?

Furthermore, what is the best way to convert k to k600? I assume that my measured reaeration rate isn't the standardized gas exchange rate and must be converted somehow.

tldr: I have measured reaeration rates that I would like to use as priors for K600 but don't know how to specify that properly in the code.

Thanks!

library(streamMetabolizer)

setwd("C:/Users/slb0035/Google Drive/School/Working/METABOLISM/streamMetabolizer/INPUT/")

data<-read.csv("sb4_spring2018.csv", stringsAsFactors = FALSE)
data$solar.time<- as.POSIXct(data$solar.time, tz='America/New_York')
lubridate::tz(data$solar.time)
data$solar.time <- streamMetabolizer::calc_solar_time(data$solar.time, longitude=-85) #FBMI longitude

calc_light(data$solar.time, 32, -85)
library(unitted)
data$light <- calc_light(u(data$solar.time), u(32, 'degN'), u(-85, 'degE'), u(3004, 'umol m^-2 s^-1'), attach.units=TRUE)

bayes_name <- mm_name(type='bayes', pool_K600="normal", err_obs_iid=TRUE, err_proc_iid=TRUE)
bayes_specs <- specs(bayes_name, K600_daily_meanlog_meanlog=0, K600_daily_meanlog_sdlog=66) 

mm <- metab(bayes_specs, data=data)
output<-as.data.frame(get_params(mm))
write.csv(output, 'sb4_spring2018-output_0-6_kmany.csv') #whatever filename is

pdf('plot_DO_preds_sb4_spring2018output_0-2_kmany.pdf')
plot_DO_preds(mm)
dev.off()

pdf('plot_metab_preds_sb4_spring2018output_0-2_kmany.pdf')
plot_metab_preds(mm)
dev.off()

mcmc <- get_mcmc(mm)
pdf('mcmc_sb4_spring2018output_0-2_kmany.pdf')
rstan::traceplot(mcmc, pars='K600_daily', nrow=3)
dev.off()

Session info ---------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, mingw32             
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Chicago             
 date     2018-10-01                  

Packages -------------------------------------------------------------------------------------------------
 package           * version   date       source        
 assertthat          0.2.0     2017-04-11 CRAN (R 3.5.1)
 base              * 3.5.1     2018-07-02 local         
 bindr               0.1.1     2018-03-13 CRAN (R 3.5.1)
 bindrcpp          * 0.2.2     2018-03-29 CRAN (R 3.5.1)
 bitops              1.0-6     2013-08-17 CRAN (R 3.5.0)
 codetools           0.2-15    2016-10-05 CRAN (R 3.5.1)
 colorspace          1.3-2     2016-12-14 CRAN (R 3.5.1)
 compiler            3.5.1     2018-07-02 local         
 crayon              1.3.4     2017-09-16 CRAN (R 3.5.1)
 curl                3.2       2018-03-28 CRAN (R 3.5.1)
 datasets          * 3.5.1     2018-07-02 local         
 deSolve             1.21      2018-05-09 CRAN (R 3.5.0)
 devtools            1.13.6    2018-06-27 CRAN (R 3.5.1)
 digest              0.6.17    2018-09-12 CRAN (R 3.5.1)
 dplyr               0.7.6     2018-06-29 CRAN (R 3.5.1)
 ggplot2           * 3.0.0     2018-07-03 CRAN (R 3.5.1)
 glue                1.3.0     2018-07-17 CRAN (R 3.5.1)
 graphics          * 3.5.1     2018-07-02 local         
 grDevices         * 3.5.1     2018-07-02 local         
 grid                3.5.1     2018-07-02 local         
 gridExtra           2.3       2017-09-09 CRAN (R 3.5.1)
 gtable              0.2.0     2016-02-26 CRAN (R 3.5.1)
 httr                1.3.1     2017-08-20 CRAN (R 3.5.1)
 inline              0.3.15    2018-05-18 CRAN (R 3.5.1)
 jsonlite            1.5       2017-06-01 CRAN (R 3.5.1)
 labeling            0.3       2014-08-23 CRAN (R 3.5.0)
 LakeMetabolizer     1.5.0     2016-06-23 CRAN (R 3.5.1)
 lazyeval            0.2.1     2017-10-29 CRAN (R 3.5.1)
 lubridate           1.7.4     2018-04-11 CRAN (R 3.5.1)
 magrittr            1.5       2014-11-22 CRAN (R 3.5.1)
 memoise             1.1.0     2017-04-21 CRAN (R 3.5.1)
 methods           * 3.5.1     2018-07-02 local         
 munsell             0.5.0     2018-06-12 CRAN (R 3.5.1)
 parallel            3.5.1     2018-07-02 local         
 pillar              1.3.0     2018-07-14 CRAN (R 3.5.1)
 pkgconfig           2.0.2     2018-08-16 CRAN (R 3.5.1)
 plyr                1.8.4     2016-06-08 CRAN (R 3.5.1)
 purrr               0.2.5     2018-05-29 CRAN (R 3.5.1)
 R6                  2.2.2     2017-06-17 CRAN (R 3.5.1)
 Rcpp                0.12.18   2018-07-23 CRAN (R 3.5.1)
 RCurl               1.95-4.11 2018-07-15 CRAN (R 3.5.1)
 reshape2            1.4.3     2017-12-11 CRAN (R 3.5.1)
 rLakeAnalyzer       1.11.4    2018-03-14 CRAN (R 3.5.1)
 rlang               0.2.2     2018-08-16 CRAN (R 3.5.1)
 rstan             * 2.17.3    2018-01-20 CRAN (R 3.5.1)
 rstudioapi          0.7       2017-09-07 CRAN (R 3.5.1)
 scales              1.0.0     2018-08-09 CRAN (R 3.5.1)
 StanHeaders       * 2.17.2    2018-01-20 CRAN (R 3.5.1)
 stats             * 3.5.1     2018-07-02 local         
 stats4              3.5.1     2018-07-02 local         
 streamMetabolizer * 0.10.9    2018-05-09 local         
 stringi             1.1.7     2018-03-12 CRAN (R 3.5.0)
 stringr             1.3.1     2018-05-10 CRAN (R 3.5.1)
 tibble              1.4.2     2018-01-22 CRAN (R 3.5.1)
 tidyr               0.8.1     2018-05-18 CRAN (R 3.5.1)
 tidyselect          0.2.4     2018-02-26 CRAN (R 3.5.1)
 tools               3.5.1     2018-07-02 local         
 unitted           * 0.2.9     2018-05-09 local         
 utils             * 3.5.1     2018-07-02 local         
 withr               2.1.2     2018-03-15 CRAN (R 3.5.1)
 XML                 3.98-1.16 2018-08-19 CRAN (R 3.5.1)
 yaml                2.2.0     2018-07-25 CRAN (R 3.5.1)
aappling-usgs commented 5 years ago

See this issue + answer for a way to fix K to a single number. Basically, you should pick a distribution for K600_daily_meanlog that is very narrow (tiny meanlog_sdlog) and centered (meanlog_meanlog) on 66/d. Definitions of K600_daily_meanlog_meanlog and K600_daily_meanlog_sdlog are available at ?specs (online at https://rdrr.io/github/USGS-R/streamMetabolizer/man/specs.html)

A useful discussion of the relationships among various forms of k, k2, K, K600, etc. can be found in Raymond et al. 2012 (https://doi.org/10.1215/21573689-1597669).

samlbickley commented 5 years ago

Thank you, Allison! That cleared things up for me.

What if I don't want to set K to a single number, but still used previously measured values to estimate K? Is that possible or is that exactly what I've done?

aappling-usgs commented 5 years ago

Hi Sam, it's a bit tricker to set K to multiple numbers, but it can be done. If you have distinct Q on each day or are willing to spoof your Q inputs, you can set up a very strong prior with a multi-node K~Q relationship relating your measured K values to the daily mean Q values on those days. This would mean using a Kb model, i.e., pool_K600='binned'. Spoofing Q should be fine because Q, unlike depth, is only used to as a predictor of K if it's used at all.

samlbickley commented 5 years ago

Thank you, Alison. You've been very helpful.

samlbickley commented 5 years ago

I think I closed this issue prematurely.

I am ignorant of what a multi-node relationship is. There doesn't appear to be any type of relationship between my K and Q values, however. Without a strong relationship between the two, is it possible to develop this multi-node relationship? Are there any example codes of setting up the K ~ Q relationship?

aappling-usgs commented 5 years ago

By multi-node relationship I meant a piecewise linear relationship between K and a variable named Q or discharge, which is what's provided by a b_Kb*_... model (pool_K600='binned*' in mm_name()). See https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017JG004140 figure 5b for some examples.

Rather than passing in actual discharge values, you can pass in arbitrary values. For example, you can set Q to 1 for day 1, 2 for day 2, etc. in your data_daily that you're passing to metab().

You can specify as many values of K600_lnQ_nodes_centers as you want in specs(). Set one value for every value of Q - in the above example, this would be log(1:n) where n is the number of days you're modeling.

You can set a tight prior on the value of K600 at each lnQ_node_center. Set K600_lnQ_nodes_meanlog to a vector of values as long as K600_lnQ_nodes_centers where the values are the logs of your measured/estimated values of K600 on each day.

Set a very small value for K600_lnQ_nodes_sdlog to encourage the model to stick tight to the values of K600_lnQ_nodes_meanlog you set.

You might need to relax K600_lnQ_nodediffs_sdlog (make it a higher value); this could involve some experimentation to see if it matters, or you could just make it really high from the outset.

Use pool_K600='binned_sdzero' if you want to [pretty nearly] fix the daily K600 values at the values you provide.

See ?specs for more information on each of the above specifications/parameters.

I've never done this before (my focus has always been on learning rather than fixing K) and don't have time to work up example code, but if you can get the above instructions to work (I really think they will), please do share your code here so that others can benefit. Thanks, Alison

samlbickley commented 5 years ago

I've attached my code, lnK and daily_Q files, out model outputs.

First, thank you for you help. I was able to get the model to run, but the estimated DO is quite wonky and I'm getting negative GPP estimates.

I've got low-GPP sites, and the model gives negative GPP for nearly all of my streams. I was told by AJ Reisinger that negative GPP on all dates probably indicates ER estimates are not reliable and that there's just not enough GPP to model metabolism. We have measured K values from previous work, so I fixed K600 to the mean of measured K600 (thank you for your help!) and was able to get positive GPP and ER rates that are within the range previously measured. If I've correctly followed your instructions and am still getting negative GPP and basically unmodellable days, should I just stick with my fixed k models, which is giving me reasonable estimates?

library(streamMetabolizer)

setwd("C:/Users/sam/Google Drive/School/Working/METABOLISM/streamMetabolizer/INPUT/")

data<-read.csv("sb4_spring2018.csv", stringsAsFactors = FALSE)
data$solar.time<- as.POSIXct(data$solar.time, tz='America/New_York')
lubridate::tz(data$solar.time)
data$solar.time <- streamMetabolizer::calc_solar_time(data$solar.time, longitude=-85) #FBMI longitude

calc_light(data$solar.time, 32, -85)
library(unitted)
data$light <- calc_light(u(data$solar.time), u(32, 'degN'), u(-85, 'degE'), u(3004, 'umol m^-2 s^-1'), attach.units=TRUE)

Q_daily<-read.csv("Q_daily.csv", stringsAsFactors = FALSE)
#Rather than passing in actual discharge values, you can pass in arbitrary values. 
#For example, you can set Q to 1 for day 1, 2 for day 2, etc. in your data_daily that you're passing to metab().
Q_daily$date <- as.Date(Q_daily$date,format="%Y-%m-%d")

lnk_input<-read.csv("sb4_lnK.csv", stringsAsFactors = FALSE)
#You can set a tight prior on the value of K600 at each lnQ_node_center. 
#Set K600_lnQ_nodes_meanlog to a vector of values as long as K600_lnQ_nodes_centers 
#where the values are the logs of your measured/estimated values of K600 on each day.
lnk<-lnk_input[,'HEADER']
#make dataframe a vector

bayes_name <- mm_name(type='bayes', pool_K600="binned", err_obs_iid=TRUE, err_proc_iid=TRUE)
bayes_specs <- specs(bayes_name, K600_lnQ_nodes_centers = log(1:10), K600_lnQ_nodes_meanlog = lnK, K600_lnQ_nodes_sdlog = 0.00001, K600_lnQ_nodediffs_sdlog = 1000) 

mm <- metab(bayes_specs, data=data, data_daily=Q_daily)

Priors and outputs

aappling-usgs commented 5 years ago

Thanks for sharing this code, Sam.

I'm only seeing 9 values in sb4_lnK.csv, 8 of which get read by read.csv as non-header rows, but K600_lnQ_nodes_centers has length 10...is it possible that file is out of date relative to your code, or vice versa?

The K600.daily predictions don't look like a great match to exp(lnK), though maybe it's because I'm not looking at the right lnK values...is that what you found, too? That said, the match seems close enough that I'm surprised that you're getting negative GPP with this model but positive GPP (significantly above 0?) with a Bayesian model. Hmm.

If by "my fixed K models" you mean 2-parameter MLE models, I think it's plausible that those would work better for you. A major advantage of the Bayesian models is usually their ability to help you estimate K, so if you already have those numbers, the advantages relative to MLE models drop off substantially. (I do still expect some advantage because the Bayesian models can be state-space, including both process error and observation error, but K may well be the bigger issue for your data.) I would just caution against basing that decision entirely on whether your GPP values are coming out as you hope or not...that doesn't come across as especially sound science, and as AJ said, you need some signal (GPP relative to K) to pull truth out of a timeseries. If you have very high confidence in your outside estimates of K600, that's a better reason to use a fixed-K model than that the GPP is coming out positive.

aappling-usgs commented 5 years ago

Also, are you sure K is as high as you say, and in the right units? Those DO predictions are abysmal, and I'm surprised the model couldn't at least find a better fit to the data even with high uncertainty in estimated GPP and ER values...

Also, you may need to run these Bayesian models for a longer warmup period; it looks from your traceplots like they don't all converge until later in the saved-steps period.

samlbickley commented 5 years ago

1) There are 9 measured K values and I'm attempting to model 10 days of metabolism, so I assume this is the reason for the discrepancy. I have adjusted the number of modeled days to reflect the number of measured K values.

2) Could you explain the importance of exp(lnK)? I don't understand your comment.

3) My GPP estimates are not significantly above zero, The average GPP when I fixed K600 to a mean of measure K600 was 0.120825. I agree with your statement about sound science and not basing my decision on positive GPP alone, especially with GPP values so low.

4) My "fixed k model" is a Bayesian model. Should I run the "fixed" model as an MLE model? The "fixed k" model I ran is the following code:


library(streamMetabolizer)

setwd("C:/Users/slb0035/Google Drive/School/Working/METABOLISM/streamMetabolizer/INPUT/")

data<-read.csv("sb4_spring2018.csv", stringsAsFactors = FALSE)
data$solar.time<- as.POSIXct(data$solar.time, tz='America/New_York')
lubridate::tz(data$solar.time)
data$solar.time <- streamMetabolizer::calc_solar_time(data$solar.time, longitude=-85) #FBMI longitude

calc_light(data$solar.time, 32, -85)
library(unitted)
data$light <- calc_light(u(data$solar.time), u(32, 'degN'), u(-85, 'degE'), u(3004, 'umol m^-2 s^-1'), attach.units=TRUE)

bayes_name <- mm_name(type='bayes', pool_K600="normal", err_obs_iid=TRUE, err_proc_iid=TRUE)
bayes_specs <- specs(bayes_name, K600_daily_meanlog_meanlog=log(3.477262235), K600_daily_meanlog_sdlog=0.00001) 
#the above sets k600 to fixed value and sdlog keeps variation small

mm <- metab(bayes_specs, data=data)

5) I'm confident in my measured rates of reaeration rates, but I haven't had anyone check my calculations yet . Here are my K to K600 calculations, based on Raymond et al. 2012 and the formulas referenced therein.

6) I doubled burn-in and saved-steps and got better MCMC convergence, but the DO predictions are just as terrible.

How you would you recommend going forward with such low-productivity streams? With GPP so close to zero (both positive and negative), can I say anything about metabolism in these systems with any confidence?