MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA
https://maxhalford.github.io/prince
MIT License
1.27k stars 184 forks source link

MemoryError issue #15

Closed GoingMyWay closed 6 years ago

GoingMyWay commented 6 years ago

The memory of my machine has 120 GB, and there are 40 GB left for me to conduct MCA computation.

The DataFrame has a shape of (1244210, 37), and I have processed the DataFrame with get_dummy() function in Pandas.

And I want to get 10 components, however, I got MemoryError here

>>> mca_result = prince.MCA(X_MCA, n_components=10)
MemoryError                               Traceback (most recent call last)
<ipython-input-20-ee2308cc121f> in <module>()
----> 1 mca_result = prince.MCA(X_MCA, n_components=10)

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/mca.py in __init__(self, dataframe, n_components, use_benzecri_rates, plotter)
     43             dataframe=pd.get_dummies(dataframe),
     44             n_components=n_components,
---> 45             plotter=plotter
     46         )
     47 

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in __init__(self, dataframe, n_components, plotter)
     26         self._set_plotter(plotter_name=plotter)
     27 
---> 28         self._compute_svd()
     29 
     30     def _compute_svd(self):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in _compute_svd(self)
     29 
     30     def _compute_svd(self):
---> 31         self.svd = SVD(X=self.standardized_residuals, k=self.n_components)
     32 
     33     def _set_plotter(self, plotter_name):

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in standardized_residuals(self)
    123         """
    124         residuals = (self.P - self.expected_frequencies).values
--> 125         return self.row_masses.dot(residuals).dot(self.column_masses)
    126 
    127     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/prince/ca.py in row_masses(self)
     99             represents the weight of the matching row; the non-diagonal cells are equal to 0.
    100         """
--> 101         return np.diag(1 / np.sqrt(self.row_sums))
    102 
    103     @property

/home/libertatis/anaconda3/lib/python3.6/site-packages/numpy/lib/twodim_base.py in diag(v, k)
    247     if len(s) == 1:
    248         n = s[0]+abs(k)
--> 249         res = zeros((n, n), v.dtype)
    250         if k >= 0:
    251             i = k

MemoryError: 

And there are 40GB memories left for me and I can apply PCA to the DataFrame. How can I solve it?

I found a similar issue on this problem: https://github.com/esafak/mca/issues/15

MaxHalford commented 6 years ago

Hey,

Can you please try the latest version of Prince (0.3.0) with copy=False? It should be more efficient.

Regards.

MaxHalford commented 6 years ago

I'm closing this, but feel free to reopen it if it's still an issue. The MCA class now uses sparse diagonalization so it shouldn't be an issue anymore.

abdoulsn commented 4 years ago

Same error, Hello I run this code data shape (645000, 2) I got this error using jupyter notebook

import prince
mca = prince.MCA(n_components=2, engine='sklearn', copy=False, n_iter=3)
mca = mca.fit(data_cat) 
mca = mca.transform(data_cat)

Error

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-68-529e888de2c5> in <module>
      1 import prince
      2 mca = prince.MCA(n_components=2, engine='sklearn', copy=False, n_iter=3)
----> 3 mca = mca.fit(data_cat)
      4 mca = mca.transform(data_cat)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\mca.py in fit(self, X, y)
     25 
     26         # Apply CA to the indicator matrix
---> 27         super().fit(one_hot)
     28 
     29         # Compute the total inertia

~\AppData\Local\Continuum\anaconda3\lib\site-packages\prince\ca.py in fit(self, X, y)
     43 
     44         # Compute the correspondence matrix which contains the relative frequencies
---> 45         X = X / np.sum(X)
     46 
     47         # Compute row and column masses

MemoryError: Unable to allocate 4.40 GiB for an array with shape (680558, 867) and data type float64

What's the problem?

MaxHalford commented 4 years ago

Not too sure what's going on there @abdoulsn. Would it be possible to access your dataset?

abdoulsn commented 4 years ago

No sorry, which information do you need?

MaxHalford commented 4 years ago

Well I need a minimum working example to reproduce the error. It would be helpful if you could generate a toy dataset with the same characteristics as yours and reproduce the error.

abdoulsn commented 4 years ago

Cardinality of columns are ('reseau', 146), ('cdapet', 721), no missing values and I've used copy=False

abdoulsn commented 4 years ago

Something like this

>   reseau  cdapet
> 0 XX  7010Z
> 1 YY  2030Z
> 2 YY  4674B
> 3 XZ  6820B
> 4 YY_XX   6820A
> ...   ... ...
> 680553    XX  6832A
> 680554    YY  4120A
> 680555    XX_WX   7820Z
> 680556    YZ  4941A
> 680557    WX  4669A
MaxHalford commented 4 years ago

Ok I just tried it on my laptop and didn't get any issue. It might be because I have more RAM (16GB) than you do. However, the line of code that raised your MemoryException is clearly not optimal because it allocates a new array instead of modifying X inplace. I have therefore changed it to X /= np.sum(X).

abdoulsn commented 4 years ago

Let me clean my notebook memory. Thanks

abdoulsn commented 4 years ago

It's ok after restarting my notebook.

MaxHalford commented 4 years ago

Cool glad to hear it.

thomlennon commented 3 years ago

Hi everyone, i try to make a mca on a dataset of 62649 rows x 4 columns I got the same problem that abdoulsn and use as well Jupyter note book and my computer got 16384MB en RAM. I received this error message: "MemoryError: Unable to allocate 3.40 GiB for an array with shape (58264, 62649) and data type uint8"

Can you help please ? Thank you in advance

This my code below:

thomlennon commented 3 years ago

df.describe

    Cust_no                    Risk_Rating     Date               _Nb_day

0 ARAR64757686100 High 1989-07-14 9.0 1 SHDH64757636547 Low 1978-06-28 23.0 2 AYZY33546757585 Medium 1999-09-15 44.0 3 QISS46575859494 Medium 2000-02-18 61.0 4 SODJ24253673838 high 2001-07-22 50.0 ... ... ... ... ... 62644 DGDT28387374645 Medium 2002-10-03 61.0 62645 ARZU36464748484 High 1993-03-06 232.0 62646 ZRRF16263636353 High 1950-02-13 356.0 62647 ERER14253536373 High 1992-05-30 224.0 62648 ETRF53536353536 Medium 2002-10-14 984.0

[62649 rows x 4 columns]>

mca = prince.MCA( n_components=3,n_iter=3, copy=False, engine='sklearn' )


MemoryError Traceback (most recent call last)

in ----> 1 mca.fit(df2) ~/.local/lib/python3.6/site-packages/prince/mca.py in fit(self, X, y) 22 23 # One-hot encode the data ---> 24 one_hot = pd.get_dummies(X) 25 26 # Apply CA to the indicator matrix /opt/disk1/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype) 897 ) 898 with_dummies.append(dummy) --> 899 result = concat(with_dummies, axis=1) 900 else: 901 result = _get_dummies_1d( /opt/disk1/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 285 ) 286 --> 287 return op.get_result() 288 289 /opt/disk1/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/concat.py in get_result(self) 501 502 new_data = concatenate_block_managers( --> 503 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy, 504 ) 505 if not self.copy: /opt/disk1/anaconda3/lib/python3.6/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy) 58 values = b.values 59 if copy: ---> 60 values = values.copy() 61 else: 62 values = values.view() MemoryError: Unable to allocate 3.40 GiB for an array with shape (58264, 62649) and data type uint8