fate-ewi / bayesdfa

Bayesian DFA with Stan
28 stars 13 forks source link

Specification of Variance-Covariance Matrices for Process and Observation Models in bayesdfa #8

Open isabellaghement opened 4 years ago

isabellaghement commented 4 years ago

Hoping Eric or Sean will see this and answer some of the questions I have...

I just can't tell what the function fit_dfa() in the bayesdfa package assumes as defaults for:

  1. The variance-covariance matrix R of the observation model errors vt;
  2. The variance-covariance matrix Q of the process model errors et.

Is it safe to assume that fit_dfa() imposes as a default:

• Multivariate normality on the observation model errors vt; • Univariate normality (for one common trend) or multivariate normality (for more than one common trend) on the process model errors; • A diagonal form for variance-covariance matrix R, with unequal variance on the diagonal and zeroes everywhere else; • A diagonal form for variance-covariance matrix Q, with unequal variance on the diagonal and zeroes everywhere else?

Is there a way to force fit_dfa() to accept other types of variance-covariance matrices (e.g., both R and Q unstructured; or R unstructured but Q diagonal)?

Also, is there a way to extract the estimated versions for R and Q to check what they look like?

In reading the help file for fit_dfa(), I found a varIndx argument which seems to be useful in specifying the variances of the observation model errors. If only the variances can be specified, does that mean that fit_dfa() implicitly assumes a diagonal matrix for the variance-covariance matrix of the observation model errors? Is there a similar index for controlling the variances of the variance-covariance matrix of the process model errors? Or are they assumed to be different by default?

Also, I don't quite get it from the help file for fit_dfa() how varIndx is supposed to work. If we have 4 response time series and varIndx = c(1,1,1,1), I understand that we are assuming the same residual variance for each time series. If varIndx = c(1,2,3,4), I also understand that we are assuming different residual variances for the four time series. But it's not clear to me how to specify varIndx if we would like the first two time series to have the same residual variance and the last two time series to have the same residual variance (but different from that of the first two time series).

The same questions apply more or less to the function find_dfa_trends(). In this function:

  1. Is a t-distribution assumed by default for both the observation and process model errors? Can we over-ride this assumption by setting nu_fixed=101? But then why does this assumption seem to be over-written by the option compare_normal = TRUE?

  2. Are the variance-covariance matrices of the errors associated with these models both assumed to be diagonal?

  3. Is the control exercised trough the variance= option applicable only to the observation error variance, in which case we can control whether its diagonal elements are all equal (variance = "equal") or unequal (variance = "unequal")?

  4. What if we wanted to incorporate a restriction on the observation error variances similar to that mentioned above (e.g., first two time series share one residual variance; last two time series share another residual variance)? Can this be achieved and how?

The function find_dfa_trends() produces output like this:

model num_trends looic cor error converge 1 3 3 122.1905 equal student-t FALSE 2 6 3 131.9730 equal normal FALSE 3 5 2 167.5646 equal normal TRUE 4 2 2 167.8493 equal student-t TRUE 5 4 1 191.6754 equal normal TRUE 6 1 1 193.9669 equal student-t TRUE

Does the cor heading refer to the variance = "equal" option controlling the variances of the observation model errors? Why not label this heading variance to avoid confusion, especially if the variance-covariance matrix of the observation model errors is assumed (?) to be diagonal so the input time series will actually be uncorrelated?

Does the error heading refer to the distribution of the observation model errors? But then what is the assumed distribution of the process model errors? Is it the same as the one for the corresponding process model errors (e.g., if observation model errors come from a multivariate t distribution, then process model errors are assumed to also come from a multivariate t; if observation model errors come from a multivariate normal distribution, then process model errors are assumed to also come from a multivariate normal distribution)?

Thanks very much!

seananderson commented 4 years ago

Below are some quick answers. @ericward-noaa, please correct me if I've made any mistakes here.

First, you can always dig into the model itself: https://github.com/fate-ewi/bayesdfa/blob/master/src/stan_files/dfa.stan

Although unfortunately the symbols used in the code don't always match those in the paper.

That gets called by: https://github.com/fate-ewi/bayesdfa/blob/master/R/fit_dfa.R

I just can't tell what the function fit_dfa() in the bayesdfa package assumes as defaults for:

The variance-covariance matrix R of the observation model errors vt; The variance-covariance matrix Q of the process model errors et. Is it safe to assume that fit_dfa() imposes as a default:

• Multivariate normality on the observation model errors vt; • Univariate normality (for one common trend) or multivariate normality (for more than one common trend) on the process model errors; • A diagonal form for variance-covariance matrix R, with unequal variance on the diagonal and zeroes everywhere else; • A diagonal form for variance-covariance matrix Q, with unequal variance on the diagonal and zeroes everywhere else?

The observation model errors are assumed to be independent and normal or correlated as MVN with an estimated covariance matrix. https://github.com/fate-ewi/bayesdfa/blob/6799ac9be46658277682f883c83cd3d2dc855bdd/src/stan_files/dfa.stan#L237-L248

The process model errors are assumed to be independent across time series with normal or Student-t deviations with an SD/scale of 1. They are random walks or have AR/MA components. https://github.com/fate-ewi/bayesdfa/blob/6799ac9be46658277682f883c83cd3d2dc855bdd/src/stan_files/dfa.stan#L185-L212

Is there a way to force fit_dfa() to accept other types of variance-covariance matrices (e.g., both R and Q unstructured; or R unstructured but Q diagonal)?

Not presently beyond that described above.

Also, is there a way to extract the estimated versions for R and Q to check what they look like?

Yes, although for book keeping reasons the observation error standard deviations don't come out as matrices. The Stan model is in obj$model where obj is output from fit_dfa(). The observation sigmas are in sigma.

s <- sim_dfa(num_trends = 1, num_years = 20, num_ts = 3)
m <- fit_dfa(y = s$y_sim, iter = 500, chains = 1, num_trends = 2, varIndx = c(1, 2, 3))
m$model
sigmas <- rstan::extract(m$model)$sigma
head(sigmas)
iterations      [,1]      [,2]       [,3]
      [1,] 0.1509345 0.3810271 0.04269370
      [2,] 0.3303992 0.4431337 0.03873101
      [3,] 0.3303663 0.2910783 0.18046498
      [4,] 0.3600734 0.4567858 0.03880281
      [5,] 0.4929840 0.5119158 0.11838449
      [6,] 0.3622429 0.3970342 0.09009145

The correlation matrix, if estimated, is available as Omega in the Stan model output.

In reading the help file for fit_dfa(), I found a varIndx argument which seems to be useful in specifying the variances of the observation model errors. If only the variances can be specified, does that mean that fit_dfa() implicitly assumes a diagonal matrix for the variance-covariance matrix of the observation model errors? Is there a similar index for controlling the variances of the variance-covariance matrix of the process model errors? Or are they assumed to be different by default?

The process deviations are assumed to be Student-t(df, 0, 1) or Normal(0, 1) and be independent across time series. https://github.com/fate-ewi/bayesdfa/blob/6799ac9be46658277682f883c83cd3d2dc855bdd/src/stan_files/dfa.stan#L187-L212

Also, I don't quite get it from the help file for fit_dfa() how varIndx is supposed to work. If we have 4 response time series and varIndx = c(1,1,1,1), I understand that we are assuming the same residual variance for each time series. If varIndx = c(1,2,3,4), I also understand that we are assuming different residual variances for the four time series. But it's not clear to me how to specify varIndx if we would like the first two time series to have the same residual variance and the last two time series to have the same residual variance (but different from that of the first two time series).

varIndx = c(1, 1, 2, 2)

The same questions apply more or less to the function find_dfa_trends(). In this function: Is a t-distribution assumed by default for both the observation and process model errors? Can we over-ride this assumption by setting nu_fixed=101? But then why does this assumption seem to be over-written by the option compare_normal = TRUE?

Looks like it assumes a t-distribution by default for the process model. Perhaps we should change that. Observation model is always normal or MVN. May be simplest just to fit a small number of fit_dfa() models yourself and compare. You don't necessarily need the complexity of the find trends function.

Are the variance-covariance matrices of the errors associated with these models both assumed to be diagonal?

Process or observation error? See my answers above.

Is the control exercised trough the variance= option applicable only to the observation error variance, in which case we can control whether its diagonal elements are all equal (variance = "equal") or unequal (variance = "unequal")?

This just controls varIndx. Controls whether they get a bunch of 1s or a sequence of different numbers.

What if we wanted to incorporate a restriction on the observation error variances similar to that mentioned above (e.g., first two time series share one residual variance; last two time series share another residual variance)? Can this be achieved and how?

The function find_dfa_trends() produces output like this:

model num_trends looic cor error converge 1 3 3 122.1905 equal student-t FALSE 2 6 3 131.9730 equal normal FALSE 3 5 2 167.5646 equal normal TRUE 4 2 2 167.8493 equal student-t TRUE 5 4 1 191.6754 equal normal TRUE 6 1 1 193.9669 equal student-t TRUE

Does the cor heading refer to the variance = "equal" option controlling the variances of the observation model errors? Why not label this heading variance to avoid confusion, especially if the variance-covariance matrix of the observation model errors is assumed (?) to be diagonal so the input time series will actually be uncorrelated?

Yes. Probably should be called variance. Again, may be simplest to just use fit_dfa().

Does the error heading refer to the distribution of the observation model errors? But then what is the assumed distribution of the process model errors? Is it the same as the one for the corresponding process model errors (e.g., if observation model errors come from a multivariate t distribution, then process model errors are assumed to also come from a multivariate t; if observation model errors come from a multivariate normal distribution, then process model errors are assumed to also come from a multivariate normal distribution)?

No, the process deviations. Observation errors are normal or MVN depending on est_correlation.

seananderson commented 4 years ago

I just fixed the est_correlation = TRUE bug, although that will only be fixed here, not on CRAN yet.

The correlation matrix is reconstructed in the generated quantities block and is available in the Stan model output as Omega.

ericward-noaa commented 4 years ago

@seananderson did a great job answering these. I think we have a couple tiny bugs to fix.

A couple other points:

isabellaghement commented 4 years ago

Yes, excellent answers from @seananderson and helpful clarifications from @ericward-nooa. Thank you very much to both of you!

So if est_correlation = FALSE, then observation errors have a diagonal variance-covariance matrix with $sigma^2$ on the diagonal? If est_correlation = TRUE, then observation errors have an unstructured variance-covariance matrix? (Normality would be assumed in both of these cases.) I don't understand the bug involved here - what is the effect of this bug on the CRAN version of bayesdfa?

I played with the example in the bayesdfa vignette (the vignette without covariates) and found that the looic preferred the model with 3 trends even though the simulated data assumed 2 common underlying trends. So the third trend was artificial. When I increased the number of trends to 10, the looic worked as expected (though not for all choices of random seed). It would be interesting to conduct simulations on the propensity of bayesdfa versus MARSS to find non-existent trends. This makes me wonder if the vignette example should eventually be changed so that it produces the expected number of trends. Or perhaps a word of caution could be added in the vignette that trends must have a meaningful biological interpretation or else they could be artificial.

For a small number of trends, it would also be good to have control of the size of the plotting symbol when showing the fitted values produced by the model for each time series. The default option size = 0.5 works great for 10 series, say, but not so great for the 4 series example in the vignette.

The bayesdfa package is so much easier to use than MARSS! 😊

seananderson commented 4 years ago

So if est_correlation = FALSE, then observation errors have a diagonal variance-covariance matrix with $sigma^2$ on the diagonal? If est_correlation = TRUE, then observation errors have an unstructured variance-covariance matrix?

Yes. And the sigmas can be shared, independent, or grouped.

I don't understand the bug involved here - what is the effect of this bug on the CRAN version of bayesdfa?

For me, the est_correlation = TRUE would generate a bunch of Stan errors. I'm not sure if this is dependent on the version of Stan or not. The error was just that an object was initiated but never given a value. Maybe that didn't used to create an error.

I played with the example in the bayesdfa vignette (the vignette without covariates) and found that the looic preferred the model with 3 trends even though the simulated data assumed 2 common underlying trends. So the third trend was artificial. When I increased the number of trends to 10, the looic worked as expected (though not for all choices of random seed). It would be interesting to conduct simulations on the propensity of bayesdfa versus MARSS to find non-existent trends. This makes me wonder if the vignette example should eventually be changed so that it produces the expected number of trends. Or perhaps a word of caution could be added in the vignette that trends must have a meaningful biological interpretation or else they could be artificial.

I'm surprised you got a model with 10 trends to fit! Keep in mind that LOOIC is stochastic. It may take a lot of samples to converge on a consistent answer. There's also uncertainty around LOOIC and there's a way with the looic package to make a comparison of 2 LOOIC values. Neither AIC nor LOOIC is really ideal for comparing timeseries models, but we aren't going to solve that here.

For a small number of trends, it would also be good to have control of the size of the plotting symbol when showing the fitted values produced by the model for each time series. The default option size = 0.5 works great for 10 series, say, but not so great for the 4 series example in the vignette.

Good idea. I'll add an issue.