StructuralEquationModels / StructuralEquationModels.jl

A fast and flexible Structural Equation Modelling Framework
https://structuralequationmodels.github.io/StructuralEquationModels.jl/dev/
MIT License
45 stars 6 forks source link

Variables/Parameters/Observations terminology & API Cleanup #199

Open alyst opened 6 months ago

alyst commented 6 months ago

As a part of #193 I already made some changes, so I wanted to get the feedback from maintainers about it. Plus, there are a few other changes in the same direction that I can integrate into #193, so I wanted to mention them here too.

  1. Parameters. Sometimes they are called parameters, sometimes identifiers (in the ParTable). I propose to change it into param (intuitively understandable, but still short):
    • param in the ParTable
    • params() to get the vector of parameters
    • nparams() to get the number of parameters (called n_par() now)
  2. Variables. Sometimes called vars, sometimes colnames, sometimes nodes. Observed variables are sometimes called observed, sometimes manifested. I propose to consolidate into vars (short, but intuitive), which could be observed (more intuitive than manifested) or latent:
    • vars() to get the vector of variables from ParTable, RAMMatrices (matching the order of A columns)
    • nvars() to get the number of variables
    • observed_vars() to get the observed variables matching the order of rows/cols in obs_cov and rows of RAMMatrices.F Alternatively, it could be obs_vars(), which would match obs_cov() and obs_mean() (if observed_vars is chosen, then obs_cov also needs be renamed into observed_cov for consistency).
    • nobserved_vars() to get the number of observed vars (replaces n_man, which in this short form is a little bit confusing).
    • latent_var_indices()/observed_var_indices() to get the indices of vars() that match the observed/latent variables (i-th index of observed_var_indices() is for the i-th variable of observed_vars())
    • latent_vars() is a shortcut to vars()[latent_var_indices()]
    • Also, in case of missing data, I propose to use measured/missing terms (now it uses observed/missing, but observed clashes with observed/latent), and nmeasured_vars()/nmissing_vars() to get their counts
      1. Observations. Also referred to as rows. To disambiguate from observed_vars, I propose to refer to as samples (row is confusing because SEM operates with so many matrices).
    • samples to access to the individual samples (sometimes referred to as rows or rowwise).
    • nsamples() is the number of samples (n_obs() now)
      1. Relations (between the variables, i.e. <- or <->). Now the ParTable have the in param_type column, which is confusing, because sometimes it is constant.
Maximilian-Stefan-Ernst commented 6 months ago

I think it's very nice to unify this, we only have to decide which names to land on.

  1. I never liked identifier, we should definitely get rid of it. I personally prefer full words over abbreviations, so I would go for parameter, parameters() and nparameters(). I also think this is more in line with general Julia style, reading the docs "avoid abbreviation ... as it becomes difficult to remember whether and how particular words are abbreviated.".
  2. Here the no abbreviation is already annoying, because the names get quite long... so I would vote that we abbreviate variables to vars and covariance to cov, because I believe everyone will understand these, and everything else is written out (meaning we have to change obs_cov to observed_cov, but I thinks thats actually nice because it may not be obvious immediately that obs refers to observed).
    • n_man is very confusing indeed^^
    • I agree that we should not use observed in the context of missing to avoid confusion.
  3. Very nice change.
  4. So you would rename parameter_type to relation?

@aaronpeikert @brandmaier I think it's important to have these things in line with how the community usually refers to them - so maybe you can also have a look. Andreas, do you like the missing / measured naming or would you use something different?

See also #158 #159.

aaronpeikert commented 6 months ago

I quite like the suggestions. What do you think about the idea to deal with abbreviations by consistently declaring both abbreviation and fully spelled out version, e.g.:

const cov = covariance

I don't like to use observed/latent (because of e.g. observed_cov), and would prefer manifest/latent.

alyst commented 6 months ago

consistently declaring both abbreviation and fully spelled out version, e.g.: const cov = covariance

cov exists in StatsBase.jl. I personally don't think there's a need to be more explicit than standard Julia packages. Aliases may create confusion -- which one you use in the examples, which one you actually use in the package code etc. For the users it also may not be 100% obvious which is alias of which -- one has to look up it in the docs, but then it is just as easy as to document the short version right away.

I don't like to use observed/latent (because of e.g. observed_cov), and would prefer manifest/latent.

How do you define observed_cov? Is it the covariations based on the classical formula using only the observed data matrix without missing values?

SemObservedMissing uses multivariate normal EM to calculate the implicit covariations between the manifested variables. AFAIU these covariations are not exposed through public API right now, but I think it makes perfect sense to expose them, be it observed_cov or manifested_cov. Actually, in my staging branch I have modified SemObservedMissing, so it could be used in SemML just the same way as SemObservedData (SemML relies that the data object implements SemData interface); and it provides the EM-based covariations via observed_cov.

And, potentially, there could be other methods for estimating the covariations of the observed (manifested) variables. So I think it makes sense to have a generic SemData method that provides these covariations (and another method for the means).

One solution is to have both observed_cov() and manifested_cov() -- but I expect that observed_cov() will have very limited usage both internally and externally.

aaronpeikert commented 6 months ago

I agree cov is pretty clear. Just saying if we really want to use an abbreviation we could define an alias for the long form but of course we should always use the short form. But we dont have to, just a thought.

How do you define observed_cov? Is it the covariations based on the classical formula using only the observed data matrix without missing values?

Yes. For me it is about disambiguation of model implied covariance, and "observed" in the sense that it is what the data provide. So that we have manifest/latent and observed/implied. However, your point about how we deal with missing data using the implied covariance of the unrestricted model as the "observed" cov fuzzies this line.

Maximilian-Stefan-Ernst commented 6 months ago

I'm also not so happy with the aliases because I think it will create too much confusion...

I think "observed covariance matrix" is kind of the standard SEM textbook lingo in normal ML estimation (without missings), so I generally prefer observed_cov over manifest_cov, but I would be happy with both. I believe whether we call the EM-based covariance observed_cov should be decided when you do the PR with those changes, because I would have to look at the code first to see how SemData etc. is implemented.