AxeldeRomblay / MLBox

MLBox is a powerful Automated Machine Learning python library.
https://mlbox.readthedocs.io/en/latest/
Other
1.49k stars 273 forks source link

Cleaning takes too long time on multi-cores cpu #40

Closed a1a2y3 closed 5 years ago

a1a2y3 commented 6 years ago

Cleaning takes 276s for house price dataset on intel E5-2683v3 As E5-2683 has more 14cores and 28threads. I guess the problem may cause by n-job=-1 in here. ` if (self.verbose): print("cleaning data ...")

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns),
                   axis=1)

    df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1) `       

I don't know how to fix it, may be add a n_jobs arguments for class Reader? Looking for you response. Thank you.

a1a2y3 commented 6 years ago

Drift_thresholder() has same problem. It takes 1.38s on kaggle kernel, and176s on my PC with E5-2683v3 cpu.

AxeldeRomblay commented 6 years ago

Hum... sounds very weird ! Because it takes only 2 sec on my computer (7 cores). Have you tried to set n_jobs = 1 and run again ?

a1a2y3 commented 6 years ago

Thank you for reply. I think joblib or multiprocessing cause this problem, and trying to solve it. I use windows10 + anaconda + python3.6 + vs2015, may have conflict with joblib?

set n_jobs=1, seems OK reading csv : train.csv ... cleaning data ... CPU time: 0.22528505325317383 seconds reading csv : test.csv ... cleaning data ... CPU time: 0.1932668685913086 seconds

set n_jobs=2, it dies.

a1a2y3 commented 6 years ago

from http://pythonhosted.org/joblib/parallel.html#common-usage I found this "Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel."..."No code should run outside of the “if name == ‘main’” blocks, only imports and definitions."

Problem solved.

AxeldeRomblay commented 6 years ago

Yes this is what I was wondering. At the moment, MLBox does not support Windows but soon :) Thank you very much for reporting this issue !!

DarquesM commented 6 years ago

I've got same issue, where should I set n_jobs=1 ? mlbox.preprocessing.Reader does not have "n_jobs" parameter

AxeldeRomblay commented 6 years ago

Hello @DarquesM ! The problem is due to windows... At the moment what you can do is to set n_jobs=1 in the source code :

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_list)(df[col]) for col in df.columns), axis=1)

df = pd.concat(Parallel(n_jobs=-1)(delayed(convert_float_and_dates)(df[col]) for col in df.columns), axis=1)

Otherwise, I will release soon a new version with reading and cleaning separate classes...

AxeldeRomblay commented 5 years ago

Hello, thanks for reporting this issue. I will close it since this will be fixed in a next release (MLBox 0.7.1 probably)