EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.66k stars 1.56k forks source link

TPOTClassifier training causes Jupyter kernel to die and segmentation fault when running from script. #745

Open lpatruno opened 6 years ago

lpatruno commented 6 years ago

Hi,

I'm running into several issues when fitting instances of TPOTClassifier.

When I run training in a Jupyter notebook, the kernel dies after a few rounds of training. The dataset is quite small (<1000 instances, <30 features). I also notice that the kernel dies more quickly when I increase the population_size and generations arguments beyond 20. I'm setting n_jobs=1 as I've read other people have this same issue when that parameter is anything but 1. Here is the call:

tpot = TPOTClassifier(random_state=888, n_jobs=1,
                              generations=20, population_size=20,
                              verbosity=2, scoring='roc_auc')
tpot.fit(X_train, y_train)

I've also run the same code as a Python script. This results in a segmentation fault each time I run the script.

I've run the script while having top open in another bash shell and the memory consumption in the process does not exceed 1% of the overall available memory, so I don't think it's a memory issue.

I'm running this code in a kubernetes pod with the following resources:

   Requests:
      cpu:        1
      memory:     1G

Here is the version of Python:

jovyan@jupyter-pod:~/work$ python --version
Python 3.6.6

Also, here is the result of calling pip freeze:

alembic==0.9.9
appdirs==1.4.3
asn1crypto==0.24.0async-generator==1.10
attrs==18.1.0
Automat==0.0.0
backcall==0.1.0
beautifulsoup4==4.6.1
bleach==2.1.3
bokeh==0.12.16
certifi==2018.4.16
cffi==1.11.5
chardet==3.0.4
cloudpickle==0.5.3
conda==4.5.8
constantly==15.1.0
cryptography==2.2.1
cycler==0.10.0
Cython==0.28.5
dask==0.18.2
deap==1.2.2
decorator==4.3.0
dill==0.2.8.2
entrypoints==0.2.3
fastcache==1.0.2
gmpy2==2.0.8
h5py==2.7.1
html5lib==1.0.1
hyperlink==17.3.1
idna==2.7
imageio==2.3.0
incremental==17.5.0
ipykernel==4.8.2
ipython==6.5.0
ipython-genutils==0.2.0
ipywidgets==7.2.1
jedi==0.12.1
Jinja2==2.10
jsonschema==2.6.0
jupyter-client==5.2.3
jupyter-core==4.4.0
jupyterhub==0.9.1
jupyterlab==0.33.7
jupyterlab-launcher==0.11.2
kiwisolver==1.0.1
llvmlite==0.23.0
Mako==1.0.7
MarkupSafe==1.0
matplotlib==2.2.2
mistune==0.8.3
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
notebook==5.6.0
numba==0.38.1
numexpr==2.6.6
numpy==1.13.3
olefile==0.45.1
packaging==17.1
pamela==0.3.0
pandas==0.23.4
pandocfilters==1.4.2
parso==0.3.1
patsy==0.5.0
pexpect==4.6.0
pickleshare==0.7.4
Pillow==5.2.0
prometheus-client==0.3.0
prompt-toolkit==1.0.15
protobuf==3.5.2
ptyprocess==0.6.0
pyasn1==0.4.4
pyasn1-modules==0.2.1
pycosat==0.6.3
pycparser==2.18
pycurl==7.43.0.2
Pygments==2.2.0
pyOpenSSL==18.0.0
pyparsing==2.2.0
PySocks==1.6.8
python-dateutil==2.7.3
python-editor==1.0.3
python-oauth2==1.0.1
pytz==2018.5
PyWavelets==0.5.2
PyYAML==3.12
pyzmq==17.1.0
requests==2.19.1
rpy2==2.8.5
ruamel-yaml==0.15.44
scikit-image==0.14.0
scikit-learn==0.19.2
scipy==1.1.0
seaborn==0.9.0
Send2Trash==1.5.0
service-identity==17.0.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
SQLAlchemy==1.2.10
statsmodels==0.9.0
stopit==1.1.2
sympy==1.1.1
terminado==0.8.1
testpath==0.3.1
toolz==0.9.0
tornado==5.1
TPOT==0.9.3
tqdm==4.25.0
traitlets==4.3.2
Twisted==18.7.0
update-checker==0.16
urllib3==1.23
vincent==0.4.4
wcwidth==0.1.7
webencodings==0.5
widgetsnbextension==3.2.1
xgboost==0.80
xlrd==1.1.0
zope.interface==4.5.0

If there are any other tests I can perform to help debug, let me know!

weixuanfu commented 6 years ago

Hmm, it is weird. Could you please share the dataset or codes of making a simulation dataset to let us reproduce this issue?

lpatruno commented 6 years ago

Unfortunately, I cannot share the dataset as it is proprietary. However, here are some summary stats from one of the training sets:

    count   mean    std min 25% 50% 75% max
col_0   410.0   0.314634    0.464937    0.000000    0.000000    0.000000    1.000000    1.000000
col_1   410.0   85.010912   52.331006   26.908183   49.232402   68.898860   100.652436  456.152245
col_2   410.0   68.839790   49.453209   8.106944    38.153472   52.422222   77.461285   422.319444
col_3   410.0   123.073782  73.341779   34.219294   72.132352   102.015683  156.204155  521.180868
col_4   410.0   16.171122   20.801288   0.013542    4.021528    10.857558   21.449815   229.918414
col_5   410.0   54.233992   55.214856   2.070602    20.907862   35.166267   65.538912   367.913113
col_6   410.0   38.062870   51.508699   0.011088    7.069893    20.134896   45.943032   353.163113
col_7   178.0   3.629213    2.958783    1.000000    1.250000    3.000000    5.000000    17.000000
col_8   410.0   1.621951    1.363516    0.000000    1.000000    1.000000    2.000000    9.000000
col_9   410.0   0.097561    0.297083    0.000000    0.000000    0.000000    0.000000    1.000000
col_10  410.0   0.546341    0.498456    0.000000    0.000000    1.000000    1.000000    1.000000
col_11  410.0   0.348780    0.477167    0.000000    0.000000    0.000000    1.000000    1.000000
col_12  410.0   0.014634    0.120230    0.000000    0.000000    0.000000    0.000000    1.000000
col_13  410.0   0.034146    0.181827    0.000000    0.000000    0.000000    0.000000    1.000000
col_14  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_15  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_16  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_17  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_18  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_19  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_20  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_21  410.0   0.036585    0.187971    0.000000    0.000000    0.000000    0.000000    1.000000
col_22  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_23  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_24  410.0   0.012195    0.109890    0.000000    0.000000    0.000000    0.000000    1.000000
col_25  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_26  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_27  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_28  410.0   0.039024    0.193890    0.000000    0.000000    0.000000    0.000000    1.000000
col_29  410.0   0.029268    0.168764    0.000000    0.000000    0.000000    0.000000    1.000000
col_30  410.0   0.009756    0.098410    0.000000    0.000000    0.000000    0.000000    1.000000
col_31  410.0   0.009756    0.098410    0.000000    0.000000    0.000000    0.000000    1.000000
col_32  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_33  410.0   0.248780    0.432834    0.000000    0.000000    0.000000    0.000000    1.000000
col_34  410.0   0.109756    0.312967    0.000000    0.000000    0.000000    0.000000    1.000000
col_35  410.0   0.009756    0.098410    0.000000    0.000000    0.000000    0.000000    1.000000
col_36  410.0   0.007317    0.085330    0.000000    0.000000    0.000000    0.000000    1.000000
col_37  410.0   0.017073    0.129702    0.000000    0.000000    0.000000    0.000000    1.000000
col_38  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_39  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_40  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_41  410.0   0.024390    0.154446    0.000000    0.000000    0.000000    0.000000    1.000000
col_42  410.0   0.097561    0.297083    0.000000    0.000000    0.000000    0.000000    1.000000
col_43  410.0   0.004878    0.069758    0.000000    0.000000    0.000000    0.000000    1.000000
col_44  410.0   0.002439    0.049386    0.000000    0.000000    0.000000    0.000000    1.000000
col_45  410.0   0.397561    0.489992    0.000000    0.000000    0.000000    1.000000    1.000000
col_46  410.0   0.014634    0.120230    0.000000    0.000000    0.000000    0.000000    1.000000
col_47  410.0   0.397561    0.489992    0.000000    0.000000    0.000000    1.000000    1.000000
g-vega-cl commented 6 years ago

I have a similar issue, I'm working with Anaconda and the spyder IDE, after TPOT runs a few generations I get a message saying that the kernel died; 2018_08_22-spyderkerneldiedforgithub I have tried this with PyCharm and the same happens (altough pycharm returns the following error; Process finished with exit code -1073741819 (0xC0000005) ). The enviroment im working on has; 2018_08_22-pythonversionspyderforgithub

I have worked with large (> 10'000k datapoints) and small (< 1000k datapoints) datasets, I'm attaching a copy of a part of the dataset; 2018_08_22-XSLX_SnipForGithub.xlsx

Below is my code;

`import numpy as np import pandas as pd import os from tpot import TPOTRegressor

dataset = pd.read_csv(r'2018_08_22-XSLX_SnipForGithub.csv', index_col = 0)

values = dataset values = values.values

n_train_hours = 4000 train = values[:n_train_hours,] test = values[n_train_hours:, ] train_X, train_y = train[:, 1:], train[:, 0] test_X, test_y = test[:, 1:], test[:, 0] print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

tpot = TPOTRegressor(scoring = 'neg_mean_absolute_error', max_time_mins = 100, n_jobs = 1, #Sometimes I run it with -1 and it also crashes. verbosity = 2, cv = 5, warm_start = True, periodic_checkpoint_folder = 'C:/mydir')

tpot.fit(train_X, train_y) tpot.export('2018_08_22-tpot_exported_pipeline.py')`

Note; I ran the same process in another PC and it lasted longer working, but it still crashed after 12h or so, I also have run the program in google colab and Kaggle and it does not seem to crash there. Note 2; I do not have admin rights in the PC's i am using, maybe that matters. Thank you, sorry if this has already been resolved, i did not find the answer.

weixuanfu commented 6 years ago

@lpatruno I still suspect the dataset may need more memory (~2G) since some pipelines (especially for pipelines with PolynomialFeatures which can double feature number in intermediates steps) require more memory which may cause crash due to out of memory. top commend should has a refresh rate, maybe 2-3 seconds, so it maybe not accurate to check the maximum memory usage during optimization

Could you please try to run the dataset in a machine with large memory or using TPOT light configuration via config_dict='TPOT light'?

@g-vega-cl Hmm, it seems the crash only happened in PC but not in Linux in your case, right?

g-vega-cl commented 6 years ago

@weixuanfu Yes indeed, Windows 10

lpatruno commented 6 years ago

@weixuanfu Thanks for your tips. I will rerun with those parameters and report back.

lpatruno commented 6 years ago

@weixuanfu Running with config_dict='TPOT light' allows the script to complete. I am rerunning now with a larger number of generations and population_size. Thanks for the help.

aamirg commented 6 years ago

I am facing the same issues. I have tried running it on Windows 7 and Ubuntu 16.04 (AWS EC2 instance c5.4xlarge machine 32 GB RAM). I have always used n_jobs=1. On my windows machine it runs successfully but when I use Spyder IDE on Ubuntu, it unexpectedly quits saying Kernel unexpectedly stopped. I tried running it through the terminal and got a 'Segmentation fault'

My dataset size is not that huge either. Its around 335 samples x 80 features

image

david-hoffman commented 5 years ago

@weixuanfu I'm having this issue too, I'm fairly certain I didn't run out of memory.

mikkelam commented 5 years ago

Also happening for me with a sparse dataset only (288176x28) and 128GB memory

mcmchammer commented 5 years ago

I'm facing the same problem - running on alpine 3.9, with 8GB mem. (which should suffice given the data); n_jobs=1; any new insights?

david-hoffman commented 5 years ago

I've found that if I use the dask backend (use_dask=True) then everything runs smoothly.

ardunn commented 4 years ago

Also happening for me, regardless of whether it is in a Jupyter kernel or not.