EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.76k stars 1.57k forks source link

Use Dask= True or manual import method not working #764

Closed GinoWoz1 closed 6 years ago

GinoWoz1 commented 6 years ago

[provide general introduction to the issue and why it is relevant to this repository]

I cannot use multiple cores and therefore my jobs are running extremely slow.

Context of the issue

In 0.9.4 a fix was put in to use use_dask = True or to import manually. Both methods return the error

" File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 684, in fit self._update_top_pipeline()

File "C:\Users\jstnjc\Anaconda3\lib\site-packages\tpot\base.py", line 758, in _update_top_pipeline raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')

RuntimeError: A pipeline has not yet been optimized. Please call fit() first."

Process to reproduce the issue

(ive tested this on 3 different computers including a cloud service)

Install Anaconda 3.6 for windows 64bit pip Install missingno pip install these .whl files manually (need to do so for fancyimpute) -ecos-2.0.5-cp36-cp36m-win_amd64.whl -cvxpy-1.0.8-cp36-cp36m-win_amd64.whl pip install fancimpute pip install rfpimp (used for my custom functions import file) conda install py-xgboost pip install tpot pip install msgpack pip install dask[delayed] dask-ml

[ordered list the process to finding and recreating the issue, example below]

With the above - execute the code below:

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
import math
warnings.filterwarnings('ignore')

url = 'https://github.com/GinoWoz1/AdvancedHousePrices/raw/master/'

X_train = pd.read_csv(url + 'train_tpot_issue.csv')
y_train = pd.read_csv(url + 'y_train_tpot_issue.csv', header=None)

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            raise ValueError("Mean Squared Logarithmic Error cannot be used when "
                             "targets contain negative values.")
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

rmsle_loss = make_scorer(rmsle_loss,greater_is_better=False)

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train[0])

Expected result

Expect the process to run and to use all cores.

Current result

[describe what you currently experience from this process, and thereby explain the bug] image

weixuanfu commented 6 years ago

Hmm, I tested those codes under a fresh test conda environment and the error was not reproduced. But I used a easy way to install fancyimpute as the commands below. Could you please build a conda environment for a test?

conda create -n test_env python=3.6
activate test_env
pip install missingno
conda install -y -c anaconda ecos
conda install -y -c conda-forge lapack
conda install -y -c cvxgrp cvxpy
conda install -y -c cimcb fancyimpute
pip install rfpimp
conda install -y py-xgboost
pip install tpot msgpack dask[delayed] dask-ml

Another suggestion about the customized scorer in your codes. May it will be more stable if the function does not raise ValueError as the example below:

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5
GinoWoz1 commented 6 years ago

Thanks Weixuan. Quick question, how do I run the python script out of the conda environment? I am just used to opening up the script on my desktop and running it there.

GinoWoz1 commented 6 years ago

Nevermind on the python script question. I was able to setup on my laptop.

Any idea why this install process breaks the verbosity argument? everything else seems to be working fine, thanks a ton for your help.

Sincerely, Justin

weixuanfu commented 6 years ago

You're welcome. Do you mean no confirmation during installation of packages via conda? If so, the -y in the command is for this purpose.

GinoWoz1 commented 6 years ago

The progress bar doesnt show up.

weixuanfu commented 6 years ago

Hmm, I think progress bar should be not easy to catch with tons of warning messages when dask=True but it did show up in my test (as stdout below).

We need refine this warning message action when dask=True.

 **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
D:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py:547: UserWarning: Multiprocessing-backed parallel loops cannot be nested below threads, setting n_jobs=1
  **self._backend_args)
Generation 1 - Current best internal CV score: -5.969518794583038e-15
Optimization Progress:   4%|█▉                                                | 101/2550 [01:08<51:22,  1.26s/pipeline]
GinoWoz1 commented 6 years ago

Thanks, no problem. I can live without it for now just as long as the periodic checkpoints are being saved. You can close this. thanks again!

GinoWoz1 commented 6 years ago

Hmm after the first generation, the same error came up in the virtual environment. Were you able to finish one generation and save a pipeline? I did exactly as you suggested with the virtual env.

image

weixuanfu commented 6 years ago

Hmm, did you also update rmsle_loss in your codes? Can you please provide a random_state to reproduce the issue?

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5
GinoWoz1 commented 6 years ago

Thanks I did. Sorry for the bother, it looks like a user error on my side with my virtual environment. Really hate to inconvenience you. I am going to do a overview of TPOT soon for some individuals in my area at a meetup so this will help greatly! I'll make sure to give a shout out to your and your team.

Sincerely, Justin

On Fri, Sep 14, 2018 at 8:11 AM Weixuan Fu notifications@github.com wrote:

Hmm, did you also update rmsle_loss in your codes? Can you provide a random_state to reproduce the issue?

def rmsle_loss(y_true, y_pred): assert len(y_true) == len(y_pred) try: terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) * 2.0 for i,pred in enumerate(y_pred)] except: return float('inf') if not (y_true >= 0).all() and not (y_pred >= 0).all(): return float('inf') return (sum(terms_to_sum) (1.0/len(y_true))) ** 0.5

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EpistasisLab/tpot/issues/764#issuecomment-421389884, or mute the thread https://github.com/notifications/unsubscribe-auth/AQuRcbdaaedW_zFed3k6VD2AwKi7dBigks5ua8cTgaJpZM4WmMOo .

GinoWoz1 commented 6 years ago

Hey Weixuan,

With the same exact setup, I am now getting the error below. Any idea? I am unable to get TPOT to finish a single run.

image

conda create -n test_env python=3.6 activate test_env pip install missingno conda install -y -c anaconda ecos conda install -y -c conda-forge lapack conda install -y -c cvxgrp cvxpy conda install -y -c cimcb fancyimpute pip install rfpimp conda install -y py-xgboost pip install tpot msgpack dask[delayed] dask-ml

weixuanfu commented 6 years ago

Hmm it seems a xgboost API issue. I tried to reproduce this issue via the demo below but the error didn't show up. I think I recently updated xgboost to 0.80 via conda install -c anaconda py-xgboost, maybe updating xgboost will help.

from sklearn.metrics import make_scorer
from tpot import TPOTRegressor
import warnings
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import math
warnings.filterwarnings('ignore')
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)

def rmsle_loss(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    try:
        terms_to_sum = [(math.log(y_pred[i] + 1) - math.log(y_true[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    except:
        return float('inf')
    if not (y_true >= 0).all() and not (y_pred >= 0).all():
            return float('inf')
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True)
tpot.fit(X_train,y_train)
GuillaumeLab commented 4 years ago

I got the same issue. I can't use conda environment. Whenever i use "use_dask=True", i get the following error :

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

tpot = TPOTRegressor(verbosity=3, scoring = rmsle_loss, generations = 50,population_size=50,offspring_size= 50,max_eval_time_mins=10,warm_start=True, use_dask=True) tpot.fit(X_train,y_train)

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

I have tried on azure databricks cluster as well as on my local machine

weixuanfu commented 4 years ago

@GuillaumeLab which version of dask are installed in your environment?

GuillaumeLab commented 4 years ago

dask 2.24.0 .

Thanks for your answer. I also get another error message : Restarting distributed.nanny - WARNING - Worker exceeded 95% memory budget.

I checked this thread : "https://github.com/dask/distributed/issues/2297"", and it does not really help solve the issue. Tpot is working fine on a single device, no memory issue. Why distributing it on several devices would cause a memory issue?