DOI-USGS / streamMetabolizer

streamMetabolizer uses inverse modeling to estimate aquatic metabolism (photosynthesis and respiration) from time series data on dissolved oxygen, water temperature, depth, and light.
http://usgs-r.github.io/streamMetabolizer/
Other
37 stars 22 forks source link

KvQ: use log-log consistently throughout package #276

Closed aappling-usgs closed 7 years ago

aappling-usgs commented 8 years ago

We have the choice of relating K to Q as K ~ Q, ln(K) ~ ln(Q), K ~ ln(Q), or something else. We know that K600 is always >0 and can be in the tens or hundreds, and preliminary model runs with ln(K) ~ ln(Q) haven't looked bad, and @cyack's models use ln(K) ~ ln(Q). So at least for now, I'd like to use ln(K) ~ ln(Q), and I'd like to use them in all models for consistency across the package. This requires:

Caveats

I have some hesitations about going log-log:

Precedents

Fig 3 of Hall et al. 2012 has K600 (d-1) vs discharge (m3 s-1) on linear x and y axes, but this is just for a sample with narrow range for a single river.

Table 2 of Raymond et al. 2012 has k600 (m d-1) as functions of intercepts plus [discharge (m3 s-1) raised to small exponents] multiplied by velocity, slope, and/or depth also raised to exponents.

For lakes, Figs 2 & 3 of Vachon and Prairie 2013 have k600 (cm h^-1) as a linear function of wind speed (m s^-1) and as a linear function of log maximum wave height (m). they cite 6 other papers that also use linear functions to relate k600 to wind speed. Different systems, yes, but again we have k600 (m d-1) showing distributions that aren't being logged before modeling.

For large rivers and estuaries, Fig 2 of Raymond and Cole 2001 relates k600 (cm h^-1) to wind speed (m s^-1) as log(k600) ~ a + b*U.

@cyack suggested lognormal distributions for K within discharge bins, where the Q bins are partitioned on a log scale, with each bin's K also lognormally distributed relative to the previous bin's K #224.

streamMetabolizer's metab_Kmodel uses log-log relationships by default: https://github.com/USGS-R/streamMetabolizer/blob/develop/R/specs.R#L334 . One benefit is that this relationship ensures K is positive, though for Stan models it's possible to ensure that in other ways, e.g., by declaring K600_daily to have a lower bound of 0.

Citations

Hall, R. O., T. A. Kennedy, and E. J. Rosi-Marshall. 2012. Air–water oxygen exchange in a large whitewater river. Limnology and Oceanography: Fluids and Environments 2:1–11.

Raymond, P. A., and J. J. Cole. 2001. Gas Exchange in Rivers and Estuaries: Choosing a Gas Transfer Velocity. Estuaries 24:312–317.

Raymond, P. A., C. J. Zappa, D. Butman, T. L. Bott, J. Potter, P. Mulholland, A. E. Laursen, W. H. McDowell, and D. Newbold. 2012. Scaling the gas transfer velocity and hydraulic geometry in streams and small rivers. Limnology and Oceanography: Fluids and Environments 2:41–53.

Vachon, D., and Y. T. Prairie. 2013. The ecosystem size and shape dependence of gas transfer velocity versus wind speed relationships in lakes. Canadian Journal of Fisheries and Aquatic Sciences 70:1757–1764.

aappling-usgs commented 8 years ago

How many of the distributions involved should be lognormal? For example, in Kn models, should K600_daily_mu be lognormally distributed with K600_daily_mu_meanlog, K600_daily_mu_sdlog, AND should K600_daily be lognormally distributed around K600_daily_mu? Or just one? In Kl models, how should the intercept and slope be distributed?

Bringing some data to this, using unpooled MLE estimates of K600 for a random sampling of 20 of the powstreams sites.

library(mda.streams)
library(dplyr)

login_sb()
mmrs <- list_metab_runs() # use this to choose the metab_run to use in list_metab_models below
mms <- parse_metab_model_name(list_metab_models("151125 0.0.13 PRK_initial")) %>% tibble::rownames_to_column('name')
qs <- list_sites('dischdaily_calcDMean')
goodmms <- filter(mms, site %in% qs)

# download and package 20 sites worth of K estimates and Q data
mmobjs <- lapply(sample(goodmms$name, 20), get_metab_model, update_sb = F)
mmdf <- lapply(mmobjs, function(mm) {
  Qs <- get_ts(c('sitedate_calcLon','dischdaily_calcDMean'), mm@info$config$site) %>% v() %>%
    select(date=sitedate, dischdaily)
  v(mm@fit) %>%
    select(date, K600) %>%
    mutate(site=mm@info$config$site) %>%
    left_join(Qs, by='date')
}) %>% bind_rows()

# normal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + xlim(0,100) + guides(fill=FALSE, color=FALSE)

image

# lognormal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + scale_x_log10() + guides(fill=FALSE, color=FALSE)

image

Within sites, from these plots I'm thinking we could reasonably go either way: the values I probably believe are plausibly normal, while when you include the high (outlier?) values the distributions look plausibly lognormal.

And across sites?

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))

> mmstats
# A tibble: 20 × 9
            site   K600_mean      K600_sd K600_meanlog K600_sdlog    shapiro_p shapiro_p_log       skew   skew_log
           <chr>       <dbl>        <dbl>        <dbl>      <dbl>        <dbl>         <dbl>      <dbl>      <dbl>
1  nwis_01465798    7.198422     3.635844    1.8902919  0.3891734 6.071134e-12  1.227479e-04  3.3942038  0.6028886
2  nwis_01567000    8.182843     4.580731    1.7330760  1.3464897 2.387805e-01  2.500129e-05  0.5045585 -2.8620618
3  nwis_01645704   28.235169   283.970179    2.1160539  0.8291748 1.806542e-68  6.043007e-44 23.4001159  2.3439929
4  nwis_01646000   27.763019   208.421871    2.0116432  0.8945612 1.371768e-51  1.809392e-33 16.8783864  2.8305405
5  nwis_02156500  182.478068  4554.267582    1.7065460  1.0452377 1.238946e-79  9.788410e-53 27.3355427 -0.5406503
6  nwis_02336410   35.651939   335.589733    2.0858670  0.9103225 7.466696e-74  7.891511e-51 20.4001725  2.3595593
7  nwis_03539778  313.003488  6383.821676    2.0255417  1.2863783 4.902065e-43  1.079111e-16 21.4687177  0.7552151
8  nwis_04085108  117.249270   526.769261    1.3458049  2.0521171 7.992872e-21  2.382718e-11  6.2568599  1.5946011
9  nwis_04119400   30.590006   673.183472    0.7518339  1.1555543 1.391868e-54  6.895044e-23 28.4898009  0.9262468
10 nwis_04121944   20.890242   364.332030    2.0945655  0.5689432 2.125662e-74  2.028055e-53 40.5541868  2.2284317
11 nwis_04124000    5.000193    52.990446    0.5104455  0.9884796 1.398933e-73  2.176552e-43 21.0873063 -0.2255121
12 nwis_04176500   25.232869   156.022403    2.3627595  1.0241817 1.094426e-58  2.510850e-23 18.5562361  0.1564391
13 nwis_04200500  178.584426  4345.392044    2.3723908  1.2118335 1.208287e-58  3.046893e-36 31.4260765 -0.8737954
14 nwis_04208000  221.752804  4057.181260    1.2866989  1.3089821 8.547916e-58  7.651414e-39 26.9860100  3.2115262
15 nwis_05435950   11.713657   161.328172    1.7622591  0.5301671 2.058073e-69  1.712207e-46 32.4087404  1.2260452
16 nwis_06893970  509.131063 11718.171833    1.7755621  1.5579263 3.651980e-48  6.420996e-23 24.8492486  1.3568746
17 nwis_07061270   48.934465   328.175577    2.8060060  0.9313518 1.032592e-75  2.273741e-43 16.6282929  1.3638972
18 nwis_07241000 1735.503764 15041.843264    3.0261689  2.0386724 1.246902e-37  6.589762e-15 10.4683673  1.4776155
19 nwis_08374550    6.251070    43.870864    1.1445970  0.7040463 4.092336e-46  1.091311e-26 17.9850127  0.6775595
20 nwis_08446500    5.933354    28.079614    1.4747791  0.5463225 2.581376e-57  5.252483e-25 29.8624879  1.2086812

# a little more on within-site normality: usually non-normal by the shapiro-wilk normality test.
# if you believe that all the MLE-based K estimates are believable, of course...which you don't...
> table(mmstats$shapiro_p < 0.05)
##FALSE  TRUE 
##    1    19 

mmstats %>% select(-site) %>% summarize_all(median)
### A tibble: 1 × 5
##  K600_mean  K600_sd K600_meanlog K600_sdlog    shapiro_p
##      <dbl>    <dbl>        <dbl>      <dbl>        <dbl>
##1  29.41259 331.8827     1.832927   1.006331 1.718084e-57

library(cowplot)
cowplot::plot_grid(
  ggplot(mmstats, aes(x=K600_mean)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_sd)) + geom_density(fill='chartreuse3'),
  ggplot(mmstats, aes(x=K600_meanlog)) + geom_density(fill='navy'),
  ggplot(mmstats, aes(x=K600_sdlog)) + geom_density(fill='navy'))

image

Across sites, I think it's reasonable to say that the site averages for K600 are lognormally distributed.

aappling-usgs commented 8 years ago

What about KvQ? Here are plots from the MLE-based estimates of K, which we believe to be pretty weak:

ggplot(filter(mmdf, !is.na(K600) & !is.na(dischdaily)), aes(x=dischdaily, y=K600, color=site)) + 
  geom_point() + scale_x_log10() + scale_y_log10() +
  facet_wrap(~ site, scales='free') + guides(color=FALSE)

image

And for another randomly selected set of 20 sites, re-running code from the previous comment: image

And the above second set of 20 sites, plotted with non-logged K: image

The log-log version (previous) does as good or better a job of bringing out a relationship, IMO.

aappling-usgs commented 8 years ago

What priors could we use for a linear relationship between lnK and lnQ?

library(broom)
library(tidyr)
kq <- mmdf %>% 
  filter(!is.na(K600) & K600 > 0 & !is.na(dischdaily)) %>%
  group_by(site) %>%
  do(broom::tidy(lm(log(K600) ~ log(dischdaily), data=.))) %>%
  mutate(term = ifelse(term=='(Intercept)', 'intercept', 'slope')) %>%
  select(site, term, estimate) %>%
  spread(term, estimate)

ggplot(kq, aes(x=intercept, y=slope, color=site)) + geom_point()

image

plot_grid(
  ggplot(kq, aes(x=intercept)) + geom_density(fill='purple'),
  ggplot(kq, aes(x=slope)) + geom_density(fill='royalblue'))

image

These look reasonably normal to me, reasonably centered on intercept=2, slope=0.

aappling-usgs commented 8 years ago

The same plots for a new sample of 100 sites: image image

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838
aappling-usgs commented 8 years ago

Setting some default priors in specs.

np and Kn

Because

# compute stats
mmstats <- mmdf %>%
  group_by(site) %>%
  summarize(
    K600_median = median(K600, na.rm=TRUE),
    K600_mean = mean(K600, na.rm=TRUE),
    K600_sd = sd(K600, na.rm=TRUE),
    K600_medianlog = median(log(K600[K600 > 0]), na.rm=TRUE),
    K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
    K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
    shapiro_p = shapiro.test(K600)$p.value,
    shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
    skew = e1071::skewness(K600, na.rm=TRUE, type=2),
    skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats %>% select(-site) %>% summarize_all(mean)
# A tibble: 1 × 10
  K600_median K600_mean  K600_sd K600_medianlog K600_meanlog K600_sdlog   shapiro_p shapiro_p_log     skew skew_log
        <dbl>     <dbl>    <dbl>          <dbl>        <dbl>      <dbl>       <dbl>         <dbl>    <dbl>    <dbl>
1    6.738051  189.8936 2651.921       1.700344     1.770609   1.223438 0.002388875   0.001197968 19.92564 1.234337

we set

  # hyperparameters for non-hierarchical K600
  K600_daily_meanlog = log(6),
  K600_daily_sdlog = 1,

  # hyperparameters for hierarchical K600 - normal
  K600_daily_meanlog_mu = log(6),
  K600_daily_meanlog_sigma = 1,

Kn, Kl, and Kb

Because pooling ought to reduce the distance between K600_daily and K600_daily_pred, and because we've already selected

  K600_daily_sdlog = 1,

we set

  K600_daily_sdlog_scale = 0.3,

Kl

Because

> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838

we set

  # hyperparameters for hierarchical K600 - linear. defaults should be
  # reasonably constrained, not too wide
  lnK600_lnQ_intercept_mu = 2,
  lnK600_lnQ_intercept_sigma = 2.4,
  lnK600_lnQ_slope_mu = 0,
  lnK600_lnQ_slope_sigma = 0.5,

Kb

Because of the above and

ggplot(mmdf, aes(x=dischdaily, color=site)) + geom_density() + guides(color=FALSE) + scale_x_log10()

image we set

  # hyperparameters for hierarchical K600 - binned. -3:3 will catch most
  # streams to rivers as a first cut, though users should still modify
  lnK600_lnQ_nodes_centers = -3:3, # the x=lnQ values for the nodes
  lnK600_lnQ_nodes_meanlog = rep(log(6), length(K600_daily_lnQ_nodes)), # distribs for the y=K600 values of the nodes
  lnK600_lnQ_nodes_sdlog = rep(1, length(K600_daily_lnQ_nodes)),

and because

> mean(kq$slope)
[1] 0.05752507

we set

  lnK600_lnQ_nodediffs_sdlog = 0.05, # for centers 1 apart; for centers 0.2 apart, use sdlog=0.01
aappling-usgs commented 7 years ago

looks good. ran tests and vignettes, found no specs needing renaming.