GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
301 stars 54 forks source link

Fix regularized FPCA with FDataGrid #497

Closed Ddelval closed 1 year ago

Ddelval commented 1 year ago

The issue

When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

$$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

Steps to reproduce

The following code shows the problem:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
X = X[:300]
y = y[:300]

n_components = 4

fpca_regression = FPCA(
    n_components=n_components,
    regularization=L2Regularization(
        LinearDifferentialOperator(1),
        regularization_parameter=20,
    ),
)

fpca_regression.fit(X, y)

matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
print("Grid regularized:\n", matrix)

grid_points = X.grid_points
X = X.to_basis(Fourier(n_basis=10))

fpca_regression = FPCA(
    n_components=n_components,
    regularization=L2Regularization(
        LinearDifferentialOperator(1),
        regularization_parameter=0.25,
    ),
)
fpca_regression.fit(X, y)

matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
print("Basis regularized:\n", matrix)

Output:

Grid regularized:
 [[ 1.00000000e+00  6.96057795e-16  9.19403442e-17 -2.91433544e-16]
 [ 6.71554826e-16  1.00000000e+00  1.51354623e-16 -2.83627288e-16]
 [ 6.93889390e-18  1.55257751e-16  1.00000000e+00  1.49619900e-16]
 [-2.43294968e-16 -2.56305394e-16  1.47451495e-16  1.00000000e+00]]
Basis regularized:
 [[ 0.99393024 -0.0180874  -0.01324601  0.0030897 ]
 [-0.0180874   0.79784464 -0.06858017  0.07861877]
 [-0.01324601 -0.06858017  0.74805033  0.09757001]
 [ 0.0030897   0.07861877  0.09757001  0.70994851]]

In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

Current implementation

The relevant extract of the code is the following:

basis = FDataGrid(
    data_matrix=np.identity(n_points_discretization),
    grid_points=X.grid_points,
)
regularization_matrix = compute_penalty_matrix(
    basis_iterable=(basis,),
    regularization_parameter=1,
    regularization=self.regularization,
)

basis_matrix = basis.data_matrix[..., 0]
if regularization_matrix is not None:
    basis_matrix += regularization_matrix

fd_data = np.linalg.solve(
    basis_matrix.T,
    fd_data.T,
).T

# see docstring for more information
final_matrix = fd_data @ np.sqrt(weights_matrix)

pca = PCA(n_components=self.n_components)
pca.fit(final_matrix)
self.components_ = X.copy(
    data_matrix=np.transpose(
        np.linalg.solve(
            np.sqrt(weights_matrix),
            np.transpose(pca.components_),
        ),
    ),
    sample_names=(None,) * self.n_components,
)

We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

$$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

$$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

fda.usc implementation

The implementation in fda.usc is the following:

 if (lambda > 0) {
    if (is.vector(P)) {
      P <- P.penalty(tt, P)
    }
    dimp <- dim(P)
    if (!(dimp[1] == dimp[2] & dimp[1] == J)) {
      stop("Incorrect matrix dimension P")
    }
    M <- solve(diag(J) + lambda * P)
    Xcen.fdata$data <- Xcen.fdata$data %*% t(M)
  }

where

Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

Proposed solution

Mathematical analysis

Non-regularized FPCA summary

In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

$$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

subject to:

Regularized FPCA

The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

We can also obtain:

$$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

$$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

Then the constraints become:

$$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

$$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

Progress so far

This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

Regularization parameter effect

The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
X = X[:300]
y = y[:300]

def fit_plot(ax, title, regularization_parameter=0):
    fpca_regression = FPCA(
        n_components=4,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=regularization_parameter,
        ),
    )
    fpca_regression.fit(X, y)
    fpca_regression.components_.plot(axes=ax)
    ax.set_title(title)
    ax.set_xlabel("")
    ax.set_ylabel("")

fig, ax = plt.subplots(3, 2, figsize=(10, 10))

fit_plot(ax[0, 0], "Grid not regularized")
fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5)

grid_points = X.grid_points
X = X.to_basis(Fourier(n_basis=10))

fit_plot(ax[1, 0], "Basis not regularized")
fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125)

# Convert back to grid using the same points as before
X = X.to_grid(grid_points)

fit_plot(ax[2, 0], "Grid from basis not regularized")
fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5)

plt.show()

image

As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

Ddelval commented 1 year ago

In the end, the solution has been to switch to a different eigenequation to calculate the standard grid fpca. This other equation has the advantage that it can be easily modified to incorporate penalization.

When we are still dealing with functions, the eigenequation that the principal components ($\xi$) must fulfill is $V\xi= a\xi$, where $V$ is the covariance operator and $a$ is the eigenvalue. However, there are several ways to discretize this equation.

One option, the one explained in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 165) is to estimate $V\xi \approx \mathbf V\mathbf W\hat{\xi}$, where $\mathbf V$ is the covariance matrix, $\mathbf W$ is the diagonal matrix of weights and $\hat{\xi}$ is the vector the discretized values of the principal component. Using this approximation, the eigenequation becomes $\mathbf V\mathbf W\hat{\xi} = a\hat{\xi}$. Then, using a change of variable, $\mathbf W^{1/2}\mathbf V \mathbf W^{1/2} \mathbf u = a \mathbf u$, where $\mathbf u = \mathbf W^{1/2}\hat{\xi}$, we can rewrite the eigenequation as $\mathbf u = a \mathbf u$. This is the eigenequation that was being solved before.

Another option is see that the squared variance (the quantity to maximize) has the form

$$ \mathrm{var}\left(\int X\xi\right)^2 = \hat \xi^\intercal \mathbf W \mathbf X^\intercal \mathbf X \mathbf W \hat \xi $$

and the unit norm restriction is $\hat \xi^\intercal \mathbf W \hat \xi = 1$. This can be rewritten as maximizing $\langle\hat \xi, \mathbf W \mathbf X^\intercal X \mathbf W \hat\xi \rangle$ subject to $\langle\hat \xi, \mathbf W \hat \xi \rangle = 1$.

Then, using properties in the Appendix A5 of Ramsay, this corresponds to the eigenequation $\mathbf W \mathbf X^\intercal X \mathbf W \hat\xi \mathbf u = a \mathbf W\mathbf u$, where $a$ is the eigenvalue and $\mathbf u$ is the desired eigenvector.

Now, consider the chelosky decomposition $\mathbf L \mathbf L^\intercal = \mathbf W$ and the change of variable $\mathbf u = (\mathbf L^\intercal) ^{-1}\mathbf v$. Then, the eigenequation becomes $\mathbf L^{-1} \mathbf W \mathbf X^\intercal \mathbf X \mathbf W (\mathbf L^{-1})^\intercal \mathbf v = a \mathbf v$. This is the eigenequation that is being solved now. After noticing the symettry, one can see that this problem is equivalent to the multivariate problem but using the matrix $\mathbf X \mathbf W (\mathbf L^\intercal)^{-1}$ as the data matrix.

Then, when moving on to the penalized problem, as it was already stated in this issue, it is only needed to change the matrix to penalize $\mathbf L \mathbf L^\intercal = \mathbf W + \mathbf P$, where $\mathbf P$ is the penalization matrix calculated for the linear operator ($P$), in other words, $\mathbf \phi^\intercal \mathbf P \mathbf \phi = | P(\phi)|$.