StevenGolovkine / FDApy

Python Package for Functional Data Analysis
https://fdapy.readthedocs.io/en/latest/
MIT License
40 stars 15 forks source link

Add FPCA support for data with missing values #10

Closed Wieske closed 9 months ago

Wieske commented 1 year ago

Issue Description

The current package does not support having missing values (encoded as NAN) in the data when using UFPCA or MFPCA.

Issue Details

I tried to use the MFPCA function of this package to extract features from longitudinal survival outcomes similar to this paper: [1]. In survival data the observations have different lengths (because there are only samples until a certain event happens), so when using the DenseFunctionalData representation there are a lot of missing values, which are encoded as NAN. In the paper [1] they use the MFPCA R package [2], for which it is possible to have missing values encoded as NAN. Unfortuntately the current version of the FDApy package does not support this.

Proposed Solution

I have already looked through a part of the code and found a few places where I think support for missing values in the data can be easily introduced:

  1. In the mean calculation use np.nanmean instead of np.mean
  2. In the covariance estimation ignore missing values in the dot product and divide by the number of observations for each point, similar to this code snippet from [2]:
    for (i in seq_len(I)) {
    obs.points = which(!is.na(Y[i, ]))
    cov.count[obs.points, obs.points] = cov.count[obs.points, obs.points] + 1
    cov.sum[obs.points, obs.points] = cov.sum[obs.points, obs.points] + tcrossprod(Y.tilde[i, obs.points])
    }
  3. In the scores calculation of of UFPCA; calculate the scores separately for each observation where you only use the x and y values that are not missing for that observation

With these three changes I could calculate the scores without errors/ getting nan values, but I have not done further testing yet.

I don't have any experience contributing to packages (yet), but I can try to make a pull request to add this functionality if you think it would benefit more people. Let me know what you think of this issue and if you have any tips for the implementation.

References

[1]Functional survival forests for multivariate longitudinal outcomes: Dynamic prediction of Alzheimer’s disease progression [2]MFPCA R version

StevenGolovkine commented 11 months ago

Hi,

Due to implementation choice, the DenseFunctionalData class is not used to represent functional data with missing values. To do so, you should use the IrregularFunctionalData class.

However, the covariance estimation is still not implemented for IrregularFunctionalData (the estimation of the mean function should work. I think the covariance is the only missing piece to run an UFPCA for irregular functional data. You can then use the PACE algorithm to estimate the scores.

If you want, you can create a pull request to implement the covariance estimation for IrregularFunctionalData.

StevenGolovkine commented 9 months ago

The implementation is done and FPCA with IrregularFunctionalData object should work. I close the issue.