atsa-es / MARSS

Multivariate Autoregressive State-Space Modeling with R
https://atsa-es.github.io/MARSS/
GNU General Public License v3.0
51 stars 28 forks source link

Compare DFA models with spatial structure #125

Open slarge opened 3 years ago

slarge commented 3 years ago

I want to identify common trends in environmental conditions in estuaries across the Northeast US. Each estuary has about 4 stations where conditions are measured monthly. These stations are very close to each other and I want to test different hypotheses of spatial structure (similar to ATSA chapter 7.7. I think that I have set up the Z matrices in a way that could test this, but I am not very confident! I tried to create an example from the Lake WA data. From the code (below), the "two distinct sites" hypothesis seems to be supported. Is there a better way to nest subpopulations within a DFA framework? Thanks in advance for any feedback.

` library(MARSS) library(dplyr) library(tidyr)

load the data (there are 3 datasets contained here)

data(lakeWAplankton, package = "MARSS")

we want lakeWAplanktonTrans, which has been transformed

so the 0s are replaced with NAs and the data z-scored

all_dat <- lakeWAplanktonTrans %>% as_tibble() %>% filter(Year >= 1980, Year <= 1989) %>% mutate(date = as.Date(paste0(Year, "-", Month, "-1"))) %>% select(date, Cryptomonas, Diatoms, Greens, Unicells, Other.algae) %>% pivot_longer(cols = c("Cryptomonas", "Diatoms", "Greens", "Unicells", "Other.algae"))

Create a new variable "site" with some additional error

site_dat <- bind_rows(all_dat %>% mutate(site = "A"), all_dat %>% group_by(name) %>% rowwise() %>% mutate(site = "B", value = value + rnorm(1, mean = 0, sd = 1)))

site_wide <- site_dat %>% mutate(namesite = paste0(name, "", site)) %>% select(-name, -site) %>% pivot_wider(names_from = date, values_from = value)

site_names <- site_wide[[1]] site_wide <- as.matrix(site_wide[, -1]) row.names(site_wide) <- site_names N_ts <- dim(site_wide)[1]

H1 two distinct sites

Z_vals <- list("z11", 0, 0, "z11", "z12", 0, "z11", "z12", "z13", "z11", "z12", "z13", "z11", "z12", "z13", "z21", "z22", "z23", "z21", "z22", "z23", "z21", "z22", "z22", "z21", "z22", "z23", "z21", "z22", "z23") Z1 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

H2 one site

Z_vals <- list("z011", 0, 0, "z021", "z022", 0, "z031", "z032", "z033", "z041", "z042", "z043", "z051", "z052", "z053", "z061", "z062", "z063", "z071", "z072", "z073", "z081", "z082", "z083", "z091", "z092", "z093", "z101", "z102", "z103") Z2 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

aa <- "zero" ## 'aa' is the offset/scaling

DD <- "zero" # matrix(0,mm,1) ## 'DD' and 'd' are for covariates dd <- "zero" # matrix(0,1,wk_last) ## 'RR' is var-cov matrix for obs errors RR <- "diagonal and unequal" mm <- 3 ## number of processes BB <- "identity" # diag(mm) ## 'BB' is identity: 1's along the diagonal & 0's elsewhere uu <- "zero" # matrix(0, mm, 1) ## 'uu' is a column vector of 0's CC <- "zero" # matrix(0, mm, 1) ## 'CC' and 'cc' are for covariates cc <- "zero" # matrix(0, 1, wk_last) QQ <- "identity" # diag(mm) ## 'QQ' is identity

list with specifications for model vectors/matrices

mod_list <- list(Z = Z1, A = aa, D = DD, d = dd, R = RR, B = BB, U = uu, C = CC, c = cc, Q = QQ) init_list <- list(x0 = matrix(rep(0, mm), mm, 1)) con_list <- list(maxit = 3000, allow.degen = TRUE)

fit MARSS H1

dfa_1 <- MARSS(y = site_wide, model = mod_list, inits = init_list, form = "dfa", control = con_list)

fit MARSS H2

mod_list$Z <- Z2 dfa_2 <- MARSS(y = site_wide, model = mod_list, inits = init_list, control = con_list)

dfa_1$AICc ## 3308.385 dfa_2$AICc ## 3320.462`

eeholmes commented 3 years ago

Hi Scott,

First, as you discovered (because you got it to run), you have to leave off form="dfa". You don't need it since you fully specified the model with mod_list.

Second, your H1 Z matrix is close, but not quite. You want the the Z matrix to look like so Z Z So identical Z's to be stacked on top of each other since the 2 sites are responding identically to the 3 trends.

So

Z_vals <- list(
"z11", 0, 0,
"z21", "z22", 0,
"z31", "z32", "z33",
"z41", "z42", "z43",
"z51", "z52", "z53",
"z11", 0, 0,
"z21", "z22", 0,
"z31", "z32", "z32",
"z41", "z42", "z43",
"z51", "z52", "z53")
Z1 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

or you could make it like so since the Z_vals will be recycled as needed to fill rows 6:10.

Z_vals <- list(
"z11", 0, 0,
"z21", "z22", 0,
"z31", "z32", "z33",
"z41", "z42", "z43",
"z51", "z52", "z53")
Z1 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

Third, you need to demean the data for a DFA, esp since you set A="zero" and U="zero" (correctly). Those 2 lead to a zero-mean model so you need your data to be zero-mean also. In this context, I think you don't want to scale the data (rescale variance to 1.

site_wide <- zscore(site_wide, mean.only=TRUE)

The base R scale() function does the same thing but applies the scaling over columns while the MARSS function zscore() applies it over rows.

Fourth, this is not crucial, but DFA models take a long time to converge. I often do a BFGS polish to make sure I got to the maximum. It didn't make much of a difference in this case.

Fifth, you added white noise with sd=1. That is huge relative to the variance in the data and won't let you figure out if the model is working. H1 should be chosen but with sd=1, it won't since the site B data is basically all noise. I changed it to sd=sqrt(0.2).

Here is my code. With the lower sd and set.seed(123)

dfa_1b$AICc ## 2446.082
dfa_2b$AICc ## 2465.291
library(MARSS)
library(dplyr)
library(tidyr)

set.seed(123)

#load the data (there are 3 datasets contained here)

data(lakeWAplankton, package = "MARSS")

#we want lakeWAplanktonTrans, which has been transformed so the 0s are replaced with NAs and the data z-scored

all_dat <- lakeWAplanktonTrans %>%
  as_tibble() %>%
  filter(Year >= 1980,
         Year <= 1989) %>%
  mutate(date = as.Date(paste0(Year, "-", Month, "-1"))) %>%
  select(date, Cryptomonas, Diatoms, Greens, Unicells,
         Other.algae) %>%
  pivot_longer(cols = c("Cryptomonas", "Diatoms", "Greens", "Unicells",
                        "Other.algae"))

# Create a new variable "site" with some additional error

site_dat <- bind_rows(all_dat %>%
                        mutate(site = "A"),
                      all_dat %>%
                        group_by(name) %>%
                        rowwise() %>% mutate(site = "B",
                                             value = value + rnorm(1, mean = 0, sd = sqrt(0.2))))

site_wide <- site_dat %>%
  mutate(name_site = paste0(name, "_", site)) %>%
  select(-name, -site) %>%
  pivot_wider(names_from = date, values_from = value)

site_names <- site_wide[[1]]
site_wide <- as.matrix(site_wide[, -1])
row.names(site_wide) <- site_names
N_ts <- dim(site_wide)[1]

# H1 two distinct sites

Z_vals <- list(
  "z11", 0, 0,
  "z21", "z22", 0,
  "z31", "z32", "z33",
  "z41", "z42", "z43",
  "z51", "z52", "z53")
Z1 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

# H2 one site

Z_vals <- list("z011", 0, 0,
               "z021", "z022", 0,
               "z031", "z032", "z033",
               "z041", "z042", "z043",
               "z051", "z052", "z053",
               "z061", "z062", "z063",
               "z071", "z072", "z073",
               "z081", "z082", "z083",
               "z091", "z092", "z093",
               "z101", "z102", "z103")
Z2 <- matrix(Z_vals, nrow = N_ts, ncol = 3, byrow = TRUE)

aa <- "zero" ## 'aa' is the offset/scaling

DD <- "zero" # matrix(0,mm,1) ## 'DD' and 'd' are for covariates
dd <- "zero" # matrix(0,1,wk_last) ## 'RR' is var-cov matrix for obs errors
RR <- "diagonal and unequal"
mm <- 3 ## number of processes
BB <- "identity" # diag(mm) ## 'BB' is identity: 1's along the diagonal & 0's elsewhere
uu <- "zero" # matrix(0, mm, 1) ## 'uu' is a column vector of 0's
CC <- "zero" # matrix(0, mm, 1) ## 'CC' and 'cc' are for covariates
cc <- "zero" # matrix(0, 1, wk_last)
QQ <- "identity" # diag(mm) ## 'QQ' is identity

# list with specifications for model vectors/matrices

mod_list <- list(Z = Z1, A = aa, D = DD, d = dd, R = RR, B = BB,
                 U = uu, C = CC, c = cc, Q = QQ)
init_list <- list(x0 = matrix(rep(0, mm), mm, 1))
con_list <- list(maxit = 3000, allow.degen = TRUE)

site_wide <- zscore(site_wide, mean.only = TRUE)

# fit MARSS H1

dfa_1 <- MARSS(y = site_wide, model = mod_list, inits = init_list, 
               control = con_list)
dfa_1b <- MARSS(y = site_wide, model = mod_list, inits = dfa_1, method="BFGS")

# fit MARSS H2

mod_list$Z <- Z2
dfa_2 <- MARSS(y = site_wide, model = mod_list, inits = init_list,
               control = con_list)
dfa_2b <- MARSS(y = site_wide, model = mod_list, inits = dfa_2, method="BFGS")

dfa_1b$AICc ## 2446.082
dfa_2b$AICc ## 2465.291
slarge commented 3 years ago

Eli, Thank you so much for your thoughtful reply! I want to make sure I understand how the Z matrix should be set up to compare whether there is spatial structure (i.e., plankton from site A are responding differently than site B) or not (i.e., all plankton are responding independently regardless of site). You suggest to set up the Z matrix such that "identical Z's to be stacked on top of each other since the 2 sites are responding identically to the 3 trends". If site A are row 1:5, and site B are row 6:10, to me, this suggests that row1 and row6 have the same Z structure, row2 and row7, etc; which are from different sites. However, from the output of the DFA models in your example, it seems to support the 2-site hypothesis. Thank you for clarifying!

eeholmes commented 3 years ago

Scott,

The simulated data have H1 structure since dat in rows 6:10 are just data in rows 1:5 + error. That's the structure of your H1 model. Two sites with the same response to Z (since rows 6:10 are duplicates of rows 1:5).

I see you have H1 labeled as "two distinct sites". They are 2 sites that respond the same to Z. So like different sites within one lake ecosystem. For example, the unicells in the lake are z_1x_1 + z_2x_2 + z_3*x_3 (hidden pop state trajectory). Both sites are measuring that with different amounts of observation error.

You have H2 labeled as one site. I would label H2 as 2 sites that respond differently to the environment (the x's trajectories). So the unicell population trajectory that site A is tracking is different (different z's) than site B.

The selected model is H1 (correctly) .

dfa_1b$AICc ## 2446.082
dfa_2b$AICc ## 2465.291

Note the mean R estimates seem ok. The site B estimates should be on average 0.2 greater than site A:

mean(coef(dfa_1b)$R[6:10]-coef(dfa_1b)$R[1:5]) # 0.1814692

Eli