ISA-tools / stato

This is the development repository for the STATistics Ontology (STATO). For more information and demonstration on the ontology content, please visit its website:
http://stato-ontology.org/
30 stars 8 forks source link

Dependence structure #28

Closed nicholst closed 5 years ago

nicholst commented 9 years ago

I was hoping to find some terms for describing dependence found in longitudinal, repeated measures or time series data. The reason is that over in NIDM we've found the need to itemise the types of dependence assumptions on the data. In our issue #176 we've defined a Noise Dependence term with values

If you wanted to take the lead from SAS's PROC MIXED, they define 23 different covariance structures! (Our "SeriallyCorrelatedNoise" corresponds to SAS's "Toeplitz").

Anyway, I'm happy to propose some terms, but wanted some suggestions on how exhaustive I should try to be.

PS: Searching STATO for "dependent" brought me to "Repeated Measures ANOVA" STATO_0000260, which seems quite opinionated in it's description, concluding

Repeated measure ANOVA use in case of unbalanced design is discouraged as it leads to violation of conditions of applicability.

With various modern tools (e.g. R's lme4 or nlme) imbalanced data is not a problem, and nlme in particular can accommodate dependence assumptions other than sphericity. A more modern entry might go

repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit. Repeated measure ANOVA is sensitive to departure of normality (evaluation using Bartlett's test), more so when there are unbalanced groups (i.e. different sizes of sample populations).

... dropping the reference to sphericity.

proccaserra commented 9 years ago

@cmaumet , @nicholst . definition for 'repeated measure anova' now updated. thx! now I'd like to follow-up on the notion of 'dependence structure'. First, I like the change from 'noise' to 'error'. Then, question = is 'dependence structure' restricted to 'error' variable only or should it be made more generic like 'variable dependence structure'? Another question: why naming 'error dependence' to refer to 'error dependence structure'? How about having classes such as: -dependence structure --independent structure --serially correlated structure --compound symmetric structure --arbitrarily correlated structure

all of which would be used to 'denote' a variable in general, an error in a more specific way.

along the same line (as discussed in https://github.com/incf-nidash/nidm/pull/193 instead of having 'errorHomogenous || Heterogeneous Variance', rely on STATO 'homoskedascity hypothesis' set to true/false ?

nicholst commented 9 years ago

@proccaserra, thanks for the repeated measures update.

I was just thinking about our "error" dependence assumptions this morning, and they are a little problematic when looking at statistical modelling as a whole. Here's the deal...

With in the scope of brain imaging and the "mass univariate" model, we generally only consider linear models, i.e. a general linear model (GLM),

where Y is the data, X is a matrix of regressors, beta is fixed and known parameters and epsilon is the random error. In this setting, saying

is the same thing as saying

In fact, this equivalent data/noise specification even holds for a more general case, that of non-linear models with additive errors

where g() is some function that maps predictors X and parameters beta to a prediction of the data (think of fitting an arbitrary curve to data).

So! In NIDM, we have happily specified all of our distributional and dependence in terms of the error, which we can do because (1) we have a model with additive error and (2) only have considered fixed effects models (no randomness in beta).

MORE GENERALLY, this won't always be true. Let's look at the work-horse of clinical trials, binary response data. That requires logistic regression which takes the form

So what is the assumption on the data Y now? That they are independent Bernoulli trials with success probability given by the logistic function

What is the error here? It is inextricably a part of the data. (Note there's no epsilon).

Why did I opt for "error" instead of "data"? Because users often get very confused when you say "I have to assume my data is independently distributed", but they say, "But you have a linear regression of (e.g.) child height with year, and the data for year=5 is certainly more similar to data for year=4 than year=1... so how can it be independent?!" Of course, the issue is that the data is heterogeneous, but the error assumptions are (almost always) homogeneous. (The technical term is 'The data are conditionally independent'.) Hence I prefer to talk about the assumptions on the error (of an additive-error model).

So! We need to proceed cautiously, as because it could be argued that logistic regression is used way more in science as a whole than the General Linear Model (i.e. additive, Gaussian error).

I fear that to be totally general we need to talk about data assumptions. OR, we need to clarify that we're talking about error assumptions for additive models.

CC: @cmaumet

nicholst commented 9 years ago

@proccaserra @cmaumet - I've made various edits to my previous comment to fix markdown errors; be sure to view the current version in Github and not in the original email.

nicholst commented 9 years ago

I forgot to add, I like this structure proposed by @proccaserra:

-dependence structure --independent structure --serially correlated structure --compound symmetric structure --arbitrarily correlated structure

proccaserra commented 9 years ago

@nicholst, @cmaumet : I'd like to pick up from your last sentence. Indeed, working from the different types of models, defining what an additive model seems a a worthwhile track. It would require keeping an eye of the compatibility with other statistical models or assess the 'portability' of the entities we'll manipulate. Possibly relevant: http://www.gamlss.org/

nicholst commented 9 years ago

@proccaserra, OK, that's good. For us in NIDM, we can generally say we're working within the scope of the general linear model. (Confusingly, most of modern statistics uses the abbreviation "GLM" to refer to Generalised Linear Models, which is a--um--generalisation of the general linear model to account for distributions other than Gaussian.)

About http://www.gamlss.org/, indeed, that's of the ilk of the g(X;beta) example above... GAM = g(X;beta) has an additive form, \sum_j g(X_j)*beta_j; LSS: L=Location is determined by g(X;beta), while scale and shape are aspects of the (additive) error, its distribution can be something other than Gaussian. (I hadn't seen GAMLSS before... it's the most over-arching generalisation I've seen so far.)

cmaumet commented 9 years ago

@proccaserra, @nicholst, thank you for starting up the discussion.

I like the idea of defining the models we are working on. @proccaserra, would that help if we come up with a definition for "additive model" and "general linear model" as a starting point?

I also like the idea of using "dependence structure" associated to an "error model" instead of "error dependence".

Just for the record, you will find below an example of nidm:ErrorModel entity:

niiri:error_model_id a prov:Entity , nidm:ErrorModel ;
    nidm:hasErrorDistribution nidm:GaussianDistribution ;
    nidm:errorVarianceHomogeneous "true"^^xsd:boolean ;
    nidm:varianceSpatialModel nidm:SpatiallyLocal ;
    nidm:hasErrorDependence nidm:IndependentError ;
    nidm:dependenceSpatialModel nidm:SpatiallyLocal .
proccaserra commented 9 years ago

Hello @cmaumet , @nicholst .

We will be pushing additions to STATO shortly and wanted to touch base and follow up on our discussion. Also, as we just saw the recent commits, we could see that STATO definition were reused but why not using the STATO entity directly? Was there any resolution regarding the discussion about identification schema we may have missed? Best

cmaumet commented 9 years ago

Hello @proccaserra,

Thanks for the update! I am not sure to which recent commits you are referring too... But, just to clarify, we had a discussion in December within the NIDM group and agreed to give a try at alphanumeric identifiers. This means that we will be able to re-use STATO terms more easily but I still need to work on the implementation. I am planning to work on this in the next few days.

It would be great if we could get the "dependence structure" terms in STATO (so that we do not have to create them in NIDM) before your next push. Could you let us know how you would like to proceed? Would it help if we provide a first definition for each of the dependence structure terms we are using?

(I also have a couple of other terms in NIDM that I think would better "live" in STATO, I will open separate issues for those.)

nicholst commented 9 years ago

+1 for @cmaumet's update and request for dependence terms.

proccaserra commented 9 years ago

@cmaumet , brilliant news. thanks for the quick reply. WRT dependence structure, it would be a great help indeed if you could list the element you need (if you have textual definition, even better). We have so far used the SAS material provided by @nicholst in this issue but the more information, the merrier, best.

nicholst commented 9 years ago

@proccaserra I'm happy to hear you've adopted the SAS terms, if only because of their encyclopaedic completeness. That said, I'd still like to go through them. Where is the current version with these included? I couldn't find it on GitHub.

nicholst commented 9 years ago

@proccaserra: With @cmaumet & others, we're getting ready to make a release candidate for http://github.com/incf-nidash/nidm and are trying to re-use & reference as many STATO terms as possible. I was just about link to the SAS-derived covariance terms, but realised some were missing. The one I was looking for is in SAS as Unstructured; can you add that to STATO? A definition would be something like "A covariance structure where no restrictions are made on the covariance between any pair of measurements".

Also, we model 'the state of no covariance', i.e. an independence assumption. I see there is a STATO term for a test of independence, but not an assumption of independence (I know we've already had the discussion about hypotheses vs. assumptions #39). Either as part of the collection of covariance structures, or as a stand-alone "assumption", the concept of independence should be in there somewhere.

Here are two definitions: As a covariance structure, "A covariance structure specifying that all pair of measurements have zero correlation". As a more general statement "Independence is an assumption on multivariate data that asserts the probability distribution function of the data is factorable into one term per data element. Informally, observing one data element gives you no information about any other data element."

Also, we model "exchangeability", which is another fundamental statistical concept that feels *like" a "compound symmetric" covariance structure (all pairs of correlations equal), but actually is more like independence, as it refers to the whole distribution (not just covariance). It could be defined as: "Exchangeable: An assumption on multivariate data that asserts the probability distribution function of the data is invariant with respect to permutation of the data. It is a generalisation of the compound symmetry covariance structure. See http://en.wikipedia.org/wiki/Exchangeable_random_variables"