Cumsum of explained_variance_ratio_ is greater than 1

ND-edward commented 2 years ago

I would like to perform PCA with n_components = 0.9

So I first use StandardScaler() from sklearn to standardize the values and getting the below values:

[ 1.33865216  1.80350169  1.90692518  1.40305228]
 [ 0.98050987  0.68720789  0.33371191  0.67278892]
 [ 0.95059432  1.10958933  0.47673129  0.85535476]
 [-0.20264719 -0.54976631 -0.23836565 -0.81816542]
 [-1.01921185 -1.63589    -0.52440442 -1.85270517]
 [ 0.89958047 -0.03687457  0.90578946  0.79449948]
 [-1.16715811 -0.85146734 -1.23950137 -0.81816542]
 [-0.3867463   0.05363574 -0.09534626  0.33808489]]

And I used the fromadvanced_pca import CustomPCA to perform PCA with varimax rotation. Below is the code: varimax_pca = CustomPCA(n_components=n_components, rotation='varimax', random_state = 9527)

However, I found something strange that the cumsum of explained_variance_ratio is greater than 1

pca_var_ratio = varimax_pca.fit(Z).explained_variance_ratio_
 print(pca_var_ratio)
 >>>[0.57124482 1.09019268]

Is there any bug? Is it normal that the cumsum of explained variance ratio can be greater than 1? Thanks!

alfredsasko commented 2 years ago

Hi @ND-edward the n_components can only be a positive integer as the aim of the PCA is dimensionality and multicolinearity reduction. So you practically reduce x number of features to y where y < x. Using this approach you should not get explained variance ration > 1.

ND-edward commented 2 years ago

@alfredsasko So if I get a dataset with 4 features, I should set n_components be less than or equal to 3 instead of using the positive integer, 4, and the decimal 0.9, 0.8, or else?

ND-edward commented 2 years ago

@alfredsasko I still get the cumsum of explained variance ration > 1. I tried to set the n_components = 3 in a 4 features dataset, hence the code is varimax_pca = CustomPCA(n_components=3, rotation='varimax', random_state = 9527).

And the result of running the explained variance ratio as below:

pca_var_ratio = varimax_pca.fit(Z).explained_variance_ratio_
print(pca_var_ratio)
>[0.56433356 0.51035311 0.03468567]
print(pca_var_ratio.cumsum())
>[0.56433356 1.07468667 1.10937234]

The cumsum is even 1.1, and it seems does not make any sense

alfredsasko commented 2 years ago

@ND-edward check this blog variance ratio van be >1 if varimax rotation is aplied. https://www.researchgate.net/post/Why-doesnt-SPSS-show-the-of-variance-after-rotation

ND-edward commented 2 years ago

@alfredsasko Is it because the factor is overlapped after the rotation and becomes correlated? As a result, the sum of explained variance will be > 1.

ND-edward commented 2 years ago

And what should I do to present the explained variance ratio of each component after the rotation? Would it be good that use the normal PCA (no rotation) to find the explained variance ratio given that the explained ratio should be the same whether rotated or not rotated.

For example, I used the rotated PCA to discover what features contribute to each component the most first like feature a contributes to the 1st component the most by the coefficient. Then, using the non-rotated PCA to find the explained variance ratio, say 60%, of the 1st component. Hence, we can conclude that the feature explains 60% of the variance.

Would it work?

alfredsasko commented 2 years ago

@ND-edward indeed there might be a bug as varimax rotation is orthogonal and the sum of eigenvalues should not be > 1. So I advise you to do the same as you proposed. Run normal PCA log the % of explained variance and run Varimax rotation to explain factors. Watch out that the equality rule is valid only for a sum of eigenvalues, not individual values as they change with Varimax rotation. Feel free to look at the bug and submit the change request.

alfredsasko / advanced-principle-component-analysis

Cumsum of explained_variance_ratio_ is greater than 1 #5