The issue

When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

$$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

Steps to reproduce

The following code shows the problem:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
X = X[:300]
y = y[:300]

n_components = 4

fpca_regression = FPCA(
    n_components=n_components,
    regularization=L2Regularization(
        LinearDifferentialOperator(1),
        regularization_parameter=20,
    ),
)

fpca_regression.fit(X, y)

matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
print("Grid regularized:\n", matrix)

grid_points = X.grid_points
X = X.to_basis(Fourier(n_basis=10))

fpca_regression = FPCA(
    n_components=n_components,
    regularization=L2Regularization(
        LinearDifferentialOperator(1),
        regularization_parameter=0.25,
    ),
)
fpca_regression.fit(X, y)

matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
print("Basis regularized:\n", matrix)

Output:

Grid regularized:
 [[ 1.00000000e+00  6.96057795e-16  9.19403442e-17 -2.91433544e-16]
 [ 6.71554826e-16  1.00000000e+00  1.51354623e-16 -2.83627288e-16]
 [ 6.93889390e-18  1.55257751e-16  1.00000000e+00  1.49619900e-16]
 [-2.43294968e-16 -2.56305394e-16  1.47451495e-16  1.00000000e+00]]
Basis regularized:
 [[ 0.99393024 -0.0180874  -0.01324601  0.0030897 ]
 [-0.0180874   0.79784464 -0.06858017  0.07861877]
 [-0.01324601 -0.06858017  0.74805033  0.09757001]
 [ 0.0030897   0.07861877  0.09757001  0.70994851]]

In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

Current implementation

The relevant extract of the code is the following:

basis = FDataGrid(
    data_matrix=np.identity(n_points_discretization),
    grid_points=X.grid_points,
)
regularization_matrix = compute_penalty_matrix(
    basis_iterable=(basis,),
    regularization_parameter=1,
    regularization=self.regularization,
)

basis_matrix = basis.data_matrix[..., 0]
if regularization_matrix is not None:
    basis_matrix += regularization_matrix

fd_data = np.linalg.solve(
    basis_matrix.T,
    fd_data.T,
).T

# see docstring for more information
final_matrix = fd_data @ np.sqrt(weights_matrix)

pca = PCA(n_components=self.n_components)
pca.fit(final_matrix)
self.components_ = X.copy(
    data_matrix=np.transpose(
        np.linalg.solve(
            np.sqrt(weights_matrix),
            np.transpose(pca.components_),
        ),
    ),
    sample_names=(None,) * self.n_components,
)

We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

$$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

$$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

fda.usc implementation

The implementation in fda.usc is the following:

 if (lambda > 0) {
    if (is.vector(P)) {
      P <- P.penalty(tt, P)
    }
    dimp <- dim(P)
    if (!(dimp[1] == dimp[2] & dimp[1] == J)) {
      stop("Incorrect matrix dimension P")
    }
    M <- solve(diag(J) + lambda * P)
    Xcen.fdata$data <- Xcen.fdata$data %*% t(M)
  }

where

diag(J) is the identity matrix of size J
P is the penalty matrix

Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

Proposed solution

Mathematical analysis

Non-regularized FPCA summary

In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

$$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

subject to:

$\mathbf{\phi^\intercal \phi}=1$
The loadings are orthogonal $\mathbf{\phi_i^\intercal \phi_j}=0$ if $i\neq j$

Regularized FPCA

The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

$|\phi|^2=1$
$\int \phi_i \phi_j + \lambda \int D^2\phi_i D^2\phi_j= 0$

For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

We can also obtain:

$$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

$$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

Then the constraints become:

$$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

$$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

The data matrix $\mathbf X$ is replaced by $\mathbf X \mathbf {(L^\intercal)}^{-1}$.
Use the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$ so that loadings obtained are orthogonal with respect to the new inner product given by $\mathbf{L}^{-1} (\mathbf L^{-1})^\intercal$.

Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

Progress so far

This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

Regularization parameter effect

The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
X = X[:300]
y = y[:300]

def fit_plot(ax, title, regularization_parameter=0):
    fpca_regression = FPCA(
        n_components=4,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=regularization_parameter,
        ),
    )
    fpca_regression.fit(X, y)
    fpca_regression.components_.plot(axes=ax)
    ax.set_title(title)
    ax.set_xlabel("")
    ax.set_ylabel("")

fig, ax = plt.subplots(3, 2, figsize=(10, 10))

fit_plot(ax[0, 0], "Grid not regularized")
fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5)

grid_points = X.grid_points
X = X.to_basis(Fourier(n_basis=10))

fit_plot(ax[1, 0], "Basis not regularized")
fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125)

# Convert back to grid using the same points as before
X = X.to_grid(grid_points)

fit_plot(ax[2, 0], "Grid from basis not regularized")
fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5)

plt.show()

As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

GAA-UAM / scikit-fda

Fix regularized FPCA with FDataGrid #497