Closed aappling-usgs closed 7 years ago
How many of the distributions involved should be lognormal? For example, in Kn models, should K600_daily_mu be lognormally distributed with K600_daily_mu_meanlog, K600_daily_mu_sdlog, AND should K600_daily be lognormally distributed around K600_daily_mu? Or just one? In Kl models, how should the intercept and slope be distributed?
Bringing some data to this, using unpooled MLE estimates of K600 for a random sampling of 20 of the powstreams sites.
library(mda.streams)
library(dplyr)
login_sb()
mmrs <- list_metab_runs() # use this to choose the metab_run to use in list_metab_models below
mms <- parse_metab_model_name(list_metab_models("151125 0.0.13 PRK_initial")) %>% tibble::rownames_to_column('name')
qs <- list_sites('dischdaily_calcDMean')
goodmms <- filter(mms, site %in% qs)
# download and package 20 sites worth of K estimates and Q data
mmobjs <- lapply(sample(goodmms$name, 20), get_metab_model, update_sb = F)
mmdf <- lapply(mmobjs, function(mm) {
Qs <- get_ts(c('sitedate_calcLon','dischdaily_calcDMean'), mm@info$config$site) %>% v() %>%
select(date=sitedate, dischdaily)
v(mm@fit) %>%
select(date, K600) %>%
mutate(site=mm@info$config$site) %>%
left_join(Qs, by='date')
}) %>% bind_rows()
# normal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + xlim(0,100) + guides(fill=FALSE, color=FALSE)
# lognormal version
ggplot(mmdf, aes(x=K600, fill=site, color=site)) + geom_density() + facet_wrap(~ site, scales='free') + scale_x_log10() + guides(fill=FALSE, color=FALSE)
Within sites, from these plots I'm thinking we could reasonably go either way: the values I probably believe are plausibly normal, while when you include the high (outlier?) values the distributions look plausibly lognormal.
And across sites?
# compute stats
mmstats <- mmdf %>%
group_by(site) %>%
summarize(
K600_mean = mean(K600, na.rm=TRUE),
K600_sd = sd(K600, na.rm=TRUE),
K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
shapiro_p = shapiro.test(K600)$p.value,
shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
skew = e1071::skewness(K600, na.rm=TRUE, type=2),
skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats
# A tibble: 20 × 9
site K600_mean K600_sd K600_meanlog K600_sdlog shapiro_p shapiro_p_log skew skew_log
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 nwis_01465798 7.198422 3.635844 1.8902919 0.3891734 6.071134e-12 1.227479e-04 3.3942038 0.6028886
2 nwis_01567000 8.182843 4.580731 1.7330760 1.3464897 2.387805e-01 2.500129e-05 0.5045585 -2.8620618
3 nwis_01645704 28.235169 283.970179 2.1160539 0.8291748 1.806542e-68 6.043007e-44 23.4001159 2.3439929
4 nwis_01646000 27.763019 208.421871 2.0116432 0.8945612 1.371768e-51 1.809392e-33 16.8783864 2.8305405
5 nwis_02156500 182.478068 4554.267582 1.7065460 1.0452377 1.238946e-79 9.788410e-53 27.3355427 -0.5406503
6 nwis_02336410 35.651939 335.589733 2.0858670 0.9103225 7.466696e-74 7.891511e-51 20.4001725 2.3595593
7 nwis_03539778 313.003488 6383.821676 2.0255417 1.2863783 4.902065e-43 1.079111e-16 21.4687177 0.7552151
8 nwis_04085108 117.249270 526.769261 1.3458049 2.0521171 7.992872e-21 2.382718e-11 6.2568599 1.5946011
9 nwis_04119400 30.590006 673.183472 0.7518339 1.1555543 1.391868e-54 6.895044e-23 28.4898009 0.9262468
10 nwis_04121944 20.890242 364.332030 2.0945655 0.5689432 2.125662e-74 2.028055e-53 40.5541868 2.2284317
11 nwis_04124000 5.000193 52.990446 0.5104455 0.9884796 1.398933e-73 2.176552e-43 21.0873063 -0.2255121
12 nwis_04176500 25.232869 156.022403 2.3627595 1.0241817 1.094426e-58 2.510850e-23 18.5562361 0.1564391
13 nwis_04200500 178.584426 4345.392044 2.3723908 1.2118335 1.208287e-58 3.046893e-36 31.4260765 -0.8737954
14 nwis_04208000 221.752804 4057.181260 1.2866989 1.3089821 8.547916e-58 7.651414e-39 26.9860100 3.2115262
15 nwis_05435950 11.713657 161.328172 1.7622591 0.5301671 2.058073e-69 1.712207e-46 32.4087404 1.2260452
16 nwis_06893970 509.131063 11718.171833 1.7755621 1.5579263 3.651980e-48 6.420996e-23 24.8492486 1.3568746
17 nwis_07061270 48.934465 328.175577 2.8060060 0.9313518 1.032592e-75 2.273741e-43 16.6282929 1.3638972
18 nwis_07241000 1735.503764 15041.843264 3.0261689 2.0386724 1.246902e-37 6.589762e-15 10.4683673 1.4776155
19 nwis_08374550 6.251070 43.870864 1.1445970 0.7040463 4.092336e-46 1.091311e-26 17.9850127 0.6775595
20 nwis_08446500 5.933354 28.079614 1.4747791 0.5463225 2.581376e-57 5.252483e-25 29.8624879 1.2086812
# a little more on within-site normality: usually non-normal by the shapiro-wilk normality test.
# if you believe that all the MLE-based K estimates are believable, of course...which you don't...
> table(mmstats$shapiro_p < 0.05)
##FALSE TRUE
## 1 19
mmstats %>% select(-site) %>% summarize_all(median)
### A tibble: 1 × 5
## K600_mean K600_sd K600_meanlog K600_sdlog shapiro_p
## <dbl> <dbl> <dbl> <dbl> <dbl>
##1 29.41259 331.8827 1.832927 1.006331 1.718084e-57
library(cowplot)
cowplot::plot_grid(
ggplot(mmstats, aes(x=K600_mean)) + geom_density(fill='chartreuse3'),
ggplot(mmstats, aes(x=K600_sd)) + geom_density(fill='chartreuse3'),
ggplot(mmstats, aes(x=K600_meanlog)) + geom_density(fill='navy'),
ggplot(mmstats, aes(x=K600_sdlog)) + geom_density(fill='navy'))
Across sites, I think it's reasonable to say that the site averages for K600 are lognormally distributed.
What about KvQ? Here are plots from the MLE-based estimates of K, which we believe to be pretty weak:
ggplot(filter(mmdf, !is.na(K600) & !is.na(dischdaily)), aes(x=dischdaily, y=K600, color=site)) +
geom_point() + scale_x_log10() + scale_y_log10() +
facet_wrap(~ site, scales='free') + guides(color=FALSE)
And for another randomly selected set of 20 sites, re-running code from the previous comment:
And the above second set of 20 sites, plotted with non-logged K:
The log-log version (previous) does as good or better a job of bringing out a relationship, IMO.
What priors could we use for a linear relationship between lnK and lnQ?
library(broom)
library(tidyr)
kq <- mmdf %>%
filter(!is.na(K600) & K600 > 0 & !is.na(dischdaily)) %>%
group_by(site) %>%
do(broom::tidy(lm(log(K600) ~ log(dischdaily), data=.))) %>%
mutate(term = ifelse(term=='(Intercept)', 'intercept', 'slope')) %>%
select(site, term, estimate) %>%
spread(term, estimate)
ggplot(kq, aes(x=intercept, y=slope, color=site)) + geom_point()
plot_grid(
ggplot(kq, aes(x=intercept)) + geom_density(fill='purple'),
ggplot(kq, aes(x=slope)) + geom_density(fill='royalblue'))
These look reasonably normal to me, reasonably centered on intercept=2, slope=0.
The same plots for a new sample of 100 sites:
> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838
Setting some default priors in specs.
Because
# compute stats
mmstats <- mmdf %>%
group_by(site) %>%
summarize(
K600_median = median(K600, na.rm=TRUE),
K600_mean = mean(K600, na.rm=TRUE),
K600_sd = sd(K600, na.rm=TRUE),
K600_medianlog = median(log(K600[K600 > 0]), na.rm=TRUE),
K600_meanlog = mean(log(K600[K600 > 0]), na.rm=TRUE),
K600_sdlog = sd(log(K600[K600 > 0]), na.rm=TRUE),
shapiro_p = shapiro.test(K600)$p.value,
shapiro_p_log = shapiro.test(log(K600[K600 > 0]))$p.value,
skew = e1071::skewness(K600, na.rm=TRUE, type=2),
skew_log = e1071::skewness(log(K600[K600 > 0]), na.rm=TRUE, type=2))
> mmstats %>% select(-site) %>% summarize_all(mean)
# A tibble: 1 × 10
K600_median K600_mean K600_sd K600_medianlog K600_meanlog K600_sdlog shapiro_p shapiro_p_log skew skew_log
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.738051 189.8936 2651.921 1.700344 1.770609 1.223438 0.002388875 0.001197968 19.92564 1.234337
we set
# hyperparameters for non-hierarchical K600
K600_daily_meanlog = log(6),
K600_daily_sdlog = 1,
# hyperparameters for hierarchical K600 - normal
K600_daily_meanlog_mu = log(6),
K600_daily_meanlog_sigma = 1,
Because pooling ought to reduce the distance between K600_daily and K600_daily_pred, and because we've already selected
K600_daily_sdlog = 1,
we set
K600_daily_sdlog_scale = 0.3,
Because
> mean(kq$intercept)
[1] 1.904081
> sd(kq$intercept)
[1] 2.366613
> mean(kq$slope)
[1] 0.05752507
> sd(kq$slope)
[1] 0.5173838
we set
# hyperparameters for hierarchical K600 - linear. defaults should be
# reasonably constrained, not too wide
lnK600_lnQ_intercept_mu = 2,
lnK600_lnQ_intercept_sigma = 2.4,
lnK600_lnQ_slope_mu = 0,
lnK600_lnQ_slope_sigma = 0.5,
Because of the above and
ggplot(mmdf, aes(x=dischdaily, color=site)) + geom_density() + guides(color=FALSE) + scale_x_log10()
we set
# hyperparameters for hierarchical K600 - binned. -3:3 will catch most
# streams to rivers as a first cut, though users should still modify
lnK600_lnQ_nodes_centers = -3:3, # the x=lnQ values for the nodes
lnK600_lnQ_nodes_meanlog = rep(log(6), length(K600_daily_lnQ_nodes)), # distribs for the y=K600 values of the nodes
lnK600_lnQ_nodes_sdlog = rep(1, length(K600_daily_lnQ_nodes)),
and because
> mean(kq$slope)
[1] 0.05752507
we set
lnK600_lnQ_nodediffs_sdlog = 0.05, # for centers 1 apart; for centers 0.2 apart, use sdlog=0.01
looks good. ran tests and vignettes, found no specs needing renaming.
We have the choice of relating K to Q as K ~ Q, ln(K) ~ ln(Q), K ~ ln(Q), or something else. We know that K600 is always >0 and can be in the tens or hundreds, and preliminary model runs with ln(K) ~ ln(Q) haven't looked bad, and @cyack's models use ln(K) ~ ln(Q). So at least for now, I'd like to use ln(K) ~ ln(Q), and I'd like to use them in all models for consistency across the package. This requires:
b_Kb
models with ln-ln, renaming parameters as necessary to indicate distribsplot_distribs
to reflect ln-ln relationships forb_Kb
priorsb_Kb
priors to put inspecs()
b_np
and all otherb
models)plot_distribs
to reflect lognormal distribution of K600_dailyb_Kl
models to ln-lnplot_distribs
to reflect ln-ln relationships forb_Kb
priorsb_Kl
priors to put inspecs()
b_Kn
models to ln-lnplot_distribs
to reflect ln-ln relationships forb_Kb
priorsb_Kn
priors to put inspecs()
Caveats
I have some hesitations about going log-log:
Precedents
Fig 3 of Hall et al. 2012 has K600 (d-1) vs discharge (m3 s-1) on linear x and y axes, but this is just for a sample with narrow range for a single river.
Table 2 of Raymond et al. 2012 has k600 (m d-1) as functions of intercepts plus [discharge (m3 s-1) raised to small exponents] multiplied by velocity, slope, and/or depth also raised to exponents.
For lakes, Figs 2 & 3 of Vachon and Prairie 2013 have k600 (cm h^-1) as a linear function of wind speed (m s^-1) and as a linear function of log maximum wave height (m). they cite 6 other papers that also use linear functions to relate k600 to wind speed. Different systems, yes, but again we have k600 (m d-1) showing distributions that aren't being logged before modeling.
For large rivers and estuaries, Fig 2 of Raymond and Cole 2001 relates k600 (cm h^-1) to wind speed (m s^-1) as log(k600) ~ a + b*U.
@cyack suggested lognormal distributions for K within discharge bins, where the Q bins are partitioned on a log scale, with each bin's K also lognormally distributed relative to the previous bin's K #224.
streamMetabolizer's
metab_Kmodel
uses log-log relationships by default: https://github.com/USGS-R/streamMetabolizer/blob/develop/R/specs.R#L334 . One benefit is that this relationship ensures K is positive, though for Stan models it's possible to ensure that in other ways, e.g., by declaring K600_daily to have a lower bound of 0.Citations
Hall, R. O., T. A. Kennedy, and E. J. Rosi-Marshall. 2012. Air–water oxygen exchange in a large whitewater river. Limnology and Oceanography: Fluids and Environments 2:1–11.
Raymond, P. A., and J. J. Cole. 2001. Gas Exchange in Rivers and Estuaries: Choosing a Gas Transfer Velocity. Estuaries 24:312–317.
Raymond, P. A., C. J. Zappa, D. Butman, T. L. Bott, J. Potter, P. Mulholland, A. E. Laursen, W. H. McDowell, and D. Newbold. 2012. Scaling the gas transfer velocity and hydraulic geometry in streams and small rivers. Limnology and Oceanography: Fluids and Environments 2:41–53.
Vachon, D., and Y. T. Prairie. 2013. The ecosystem size and shape dependence of gas transfer velocity versus wind speed relationships in lakes. Canadian Journal of Fisheries and Aquatic Sciences 70:1757–1764.