MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA
https://maxhalford.github.io/prince
MIT License
1.27k stars 184 forks source link

Error while applying .transform() #117

Closed nico695 closed 1 year ago

nico695 commented 3 years ago

Same error that has been documented in here #56.

Tried downgrading the version to 0.7.0 through the repository that was linked in that thread. Still showing the same dimensionality error.

Here its the code:

import numpy as np import pandas as pd X_n = pd.DataFrame(data=np.random.rand(10000,2),columns=list('AB')) X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10000,4),replace=True),columns =list('CDEF')) X=pd.concat([X_n,X_c],axis=1)

from prince import FAMD

famd = FAMD(n_components = 6, n_iter = 100) famd.fit(X)

famd.transform(X.iloc[1:10,:])

I got the same error in version 0.7.0 and 0.7.1

ValueError: shapes (9,20) and (22,6) not aligned: 20 (dim 1) != 22 (dim 0)

christophe-williams commented 3 years ago

I've run into this issue a few times and it looks like it's based on how dummies are generated in _build_X_global. When the dataset you are transforming does not have examples of all the categorical variables from the larger original dataset, the resulting dummified dataset has fewer columns (in this case, 20 rather than 22).

Suggested fix for this (and #56 and #116) is to store the dummified columns in the famd and mfa models. If a new dataset being transformed only has a subset of categorical values, then its dummified dataset should have the right number of columns and one or more columns will be all zeroes. If a new dataset being transformed has new categorical values, should probably throw an error.

sibmike commented 3 years ago

Had the same issue, so had to make sure my train, validation, and test have examples of all the categorical variables, before fitting MCA. And dump columns where they don't:

keep = []
for clmn in X_train_cat.columns:
    train_cats = set(X_val_train_cat[clmn].unique())
    val_cats = set(X_val_test_cat[clmn].unique())
    test_cats = set(X_test_cat[clmn].unique())
    keep.append(train_cats == val_cats == test_cats)

keep_columns = X_train_cat.columns[keep]

But that's obviously an awkward temp solution, just to make it work. The dummy matrix workaround @christophe-williams mentioned would be nice to have.

MaxHalford commented 1 year ago

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.