dmey / synthia

📈 🐍 Multidimensional synthetic data generation with Copula and fPCA models in Python
https://dmey.github.io/synthia
MIT License
57 stars 9 forks source link

fPCA documentation #18

Closed khinsen closed 3 years ago

khinsen commented 3 years ago

Describe the bug

The documentation page on fPCA says:

PCA can be used to generate synthetic data for the high-dimensional vector $X$. For every instance $X_i$ in the data set, we compute the principal component scores $a_{i, 1}, \dots, a_{i, K}$. Because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated and we may treat them as independent.

The claim that "because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated" looks wrong to me. These scores are projections of the $X_i$ onto the elements of an orthonormal basis. That doesn't make them uncorrelated. There are lots of orthonormal bases one can project on, and for most of them the projections are not uncorrelated. You need some property of the distribution of $X$ to derive a zero correlation, for example a Gaussian distribution, for which the PCA basis yields approximately uncorrelated projections.

tnagler commented 3 years ago

This is indeed not phrased optimally. It's not just because the components are orthogonal, but because they are orthogonal eigenvectors of the covariance matrix. Here's a nice little proof.

The scores are uncorrelated irrespective of the the distribution of $X$ though. For Gaussian $X$ that just means they're also independent.

So maybe just:

The scores are uncorrelated by construction and we may treat them as independent.

?

khinsen commented 3 years ago

You are right, and your proposal looks good. Let me just nitpick a bit: "we may treat them as independent" makes some hidden assumption about the context you are working in. How about "we treat them as independent", which is merely a statement of what you do?

tnagler commented 3 years ago

Agreed!