Does Step Mix Work with Time Series Data?

Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.

https://stepmix.readthedocs.io/en/latest/index.html

MIT License

59 stars 4 forks source link

Does Step Mix Work with Time Series Data? #62

Open imessien opened 2 months ago

imessien commented 2 months ago

Hi love this project! I had a question if StepMix works with Time Series/Longitudinal Data? I have a dataset of trajectories of longitudinal values over scattered missing occurrences. I'm building a two step hybrid model. The first step involves using these repeated measurements as indicators for the latent profile analysis (LPA). You would conduct separate LPA for each variable, plotting the conditional means derived from LPA of each variable over time. This plot would represent the estimated mean of each variable within a latent LPA class at each measurement point. The interpretation of latent class membership would then be based on the trajectory of these mean estimates over time. This would identify two classes dependent on the variable. In the second stage, you use the estimated class membership (based on the highest posterior probability of class membership from the first LPA) for each variable as indicators for the latent class analysis (LCA). In this case, you have five categorical indicator variables from the LPA. You can then plot the conditional probabilities of class membership from the LCA, representing the likelihood of belonging to a particular LPA trajectory class given an LCA class membership. The meaning of the LCA classes would be determined by the composition of LPA trajectory classes within each LCA class. The 2nd stage LCA does not have time-dependent class switching problem, because the indicators for the LCA are measure-specific LPA trajectory profiles. It is not like you are doing separate LCA at different time points.

Currently comparing if wide format or long format works better? Also according the documentation gaussian_nan (first snippet) and continuous_nan (second snippet) are essentially the same for StepMix? Should I be using something else? Any other hyperparameters would be useful here?

sachaMorin commented 2 months ago

You may be interested the discussions here and here.

In short, StepMix does not directly support time series data, but RMLCA is possible.

I'm not sure I follow the specifics of your plan, but what you are suggesting sounds like a 3-step estimator with hard assignments.

Measurement data (X): you would need to structure your n respondents with k repeated measurements as an n x k data frame (i.e., wide format).
Structural data (Y): you would need to structure your n respondents with m integer-encoded categorical indicators as an n x m data frame (also wide format).

Then your estimator for two latent classes would look something like

model = StepMix(n_components=2, measurement="continuous_nan", structural="categorical", n_steps=3, assignment="modal")
model.fit(X, Y)

continuous_nan is an alias for gaussian_diag_nan. It's a good place to start.

imessien commented 2 months ago

I see now. My original hyper parameters were:

measurement="gaussian_nan", 
structural="categorical", 
n_steps=3, 
verbose=1,
max_iter=100000, #Run it more times because had an 
random_state=123,
n_init=10,
#assignment= 'soft', #preserves the uncertainty in class membership and continous nature
abs_tol=1e-6, #Doesn't work without this  because it's not converging correctly

Sorry, let me be more clear! To be more clear on my problem for the LPA Model, I'm using longitudinal data from the ROSMAP (Rush Memory and Aging Project) study to analyze cognitive and physical decline trajectories in older adults. ROSMAP is a cohort study that collects extensive data on cognitive function, physical performance, and various biomarkers to investigate factors associated with aging and neurodegenerative diseases. It has a ton of missing occurrences. I want to use an LPA model to create a high and low latent components.

The data doesn't like continous_nan at all so I used guassian_nan and it worked thanks! I'll play around with the hyperparameters so more but this is working now.

How do I get around this error:

RuntimeWarning: invalid value encountered in divide means[:, i] [/](https://vscode-remote+ssh-002dremote-002br41.vscode-resource.vscode-cdn.net/)= resp_sums[i]

What about incorporating my demographic and baseline values (structural data) as covariates and not as structural categorical? Doesn't just setting them up as structural categorical force them to act like covariates?

Appreciate your help.

FelixLaliberte commented 2 months ago

Hello,

Could you please provide more details on how and when the error occurs?

Regarding your question about covariate scales, please note that you should set structural = 'covariate' for all types of covariates. Additionally, covariates should be treated as continuous variables, which means you need to create dummies for binary or categorical covariates.

Also, please be aware that in its current version, StepMix does not support missing values on covariates.

imessien commented 1 month ago

I resolved my initial issue with StepMix by keeping my longitudinal data in long format rather than wide format. The wide format was problematic because each instance had a different number of observed time points, with so much missing data in wide format represented as NaN it breaks the model and plotting function, while the long format simply omits those data points. All the misssing values in the wide format created the "RuntimeWarning: invalid value encountered in divide" error. This only works in This worked for only doing the measurement data because the structural has to be in wide format. Trying to match up the long-format measurement data with my categorical structural data, how can I do that? Specifically, I'm getting a MemoryError during matrix multiplication:

 MemoryError: Unable to allocate 12.3 TiB for an array with shape (4249, 399238996) and data type float64

This error suggests that StepMix is trying to create an extremely large array, likely due to how the long-format measurement data is being combined with the structural data. I'm looking for advice on how to efficiently handle this combination of longitudinal measurement data and categorical structural data in StepMix? Any ideas?

FelixLaliberte commented 1 month ago

Hello,

StepMix does not support long-format data. You should always use a wide-format dataset when working with StepMix. How many measurement time points do you have? It might be preferable to group the data (e.g., by weeks or months) over time (in wide format) to minimize missing data.

That being said, StepMix may not be the most suitable tool for this type of data. There could be specialized packages for this type of analysis, but we are not aware of any Python packages that handle it.

sachaMorin commented 1 month ago

Regarding the huge matrix: StepMix internally uses an explicit one-hot encoding in the computations. This means the size of the matrix will scale with the number of categories as well as the number of variables.

Your categories should be integer-encoder starting at 0 (e.g., 0, 1, 2 if you have 3 classes). The size of your array suggests either a bug in the way you represent categories or a very high number of categorical variables. If the number of variables is the issue, Felix's suggestion to change the time resolution may help.

imessien commented 1 month ago

Thank you for the feedback. I have 7 to 10 measurement points for each instance but they're all different because it's based around time to event for each instance so it creates a ton of missing values. My question is how does it work then if I have the data structured in long format? Like I understand it won't work with categorical structural indicators but how does the initial model work with it? Also this is across 5 different variables I'm measuring too. I think