MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA
https://maxhalford.github.io/prince
MIT License
1.27k stars 184 forks source link

Difference in FAMD eigenvalues by prince and PCAmixdata/FactoMineR? #74

Closed nchelaru closed 1 year ago

nchelaru commented 5 years ago

Hello!

First of all, great job on the package! :)

I'm just starting to learn about FAMD, and have been trying to do the analysis in both R and Python. Strangely, while I am getting identical results on a dataset using the two R packages available for FAMD, PCAmixdata and FactoMineR, I am getting quite different results in terms of the eigenvalues with prince. I think I must be accessing the wrong attribute to get the eigenvalues, as more downstream analyses done using prince does give the same results as the two R packages.

For example, this is code that I am using with FactoMineR:

## Import libraries
library(FactoMineR)
library(factoextra)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19,  ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)

And these are the results I am getting:

eigenvalue variance.percent cumulative.variance.percent
Dim.1 4.50988153612814 19.6081805918615 19.6081805918615
Dim.2 3.12384033884342 13.5819145167105 33.190095108572
Dim.3 1.82860777021443 7.95046856614967 41.1405636747217
Dim.4 1.17979883732575 5.12956016228587 46.2701238370076
Dim.5 1.04993667220825 4.56494205307936 50.8350658900869
Dim.6 1.01660230193411 4.42001000840917 55.2550758984961
Dim.7 1.00407432481249 4.36554054266298 59.6206164411591
Dim.8 0.985272969616078 4.2837955200699 63.904411961229
Dim.9 0.90429878988165 3.93173386905065 67.8361458302796
Dim.10 0.84681410468683 3.68180045516013 71.5179462854397

With prince:

## Import libraries
import pandas as pd
import prince
import pprint

## Import data
df = pd.read_csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data 
famd = famd.fit(df.drop('Churn', axis=1)) ## Exclude target variable "Churn"

## Inspect principal dimensions
pp = pprint.PrettyPrinter()
pp.pprint(famd.explained_inertia_) 

I am getting very different numbers:

[0.5374656498067355,
 0.08801861905276565,
 0.057284992015226376,
 0.03937333799033976,
 0.03127878671244274,
 0.027912264693630378,
 0.02470305613207065,
 0.020807289598498833,
 0.018937227436470073,
 0.018005390670320004,
 0.01656022218673026,
 0.015976762563960006,
 0.014945073037668155,
 0.013999462067402076,
 0.013763382061419126,
 0.013589921877364007,
 0.012208282136007768,
 0.011979370577465807,
 0.011339881479543259,
 0.0070026572256823745,
 0.004847835526330159,
 4.80377794662604e-07,
 5.374980569743592e-08,
 1.0243263922726394e-09,
 5.366068243995469e-33]

I'm sure that I am just calling the wrong thing, but I can't seem to find what I should be using to get the same results as FactoMineR.

Any help will be greatly appreciated! :)

MaxHalford commented 5 years ago

Hey! I'll look into this as soon as I can.

ghost commented 5 years ago

Difference in results on IRIS dataset (including the target variable as part of PCA):

With R:

Call:
FAMD(base = iris, ncp = 3) 

Eigenvalues
                      Dim.1  Dim.2  Dim.3
Variance              3.870  1.342  0.592
% of var.            64.503 22.370  9.862
Cumulative % of var. 64.503 86.873 96.735

I tried the same with Prince without normalization:

X,y = load_iris(return_X_y=True)
X = pd.DataFrame(np.hstack([X,y.reshape(-1,1)]))
X.iloc[:,-1] = X.iloc[:,-1].astype("str")
famd = prince.FAMD(n_components=3,n_iter=10)
famd.fit(X)
print(famd.explained_inertia_)

[0.33736997081392395, 0.3314854899776377, 0.33114453920843834]

This looks like it is a data normalization/scaling issue with all principal components essentially depicting the three levels of the categorical target variable ("species"), i.e. no proportion of variance is explained by the continuous features!

To confirm this, if one sets the arguments rescale_with_mean=True and rescale_with_std=True in the super().__init__ method of mfa.py, i.e. in the global pca - the results look better:

print(famd.explained_inertia_)

[0.6412568596845379, 0.2592472748780183, 0.09949586543744376]

What do you think is going on @MaxHalford ?

MaxHalford commented 1 year ago

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.