Process was killed - Githubissues

pedrogarciafreitas commented 3 years ago

Hi,

I tried to run the following code

tpot = TPOTRegressor(generations=500, population_size=50, verbosity=2, n_jobs=16, use_dask=True)
tpot.fit(X_train, y_train)
tpot.export("tpot_{db}.py".format(db=db))
print(db, tpot.score(X_train, y_train), tpot.fitted_pipeline

but, after few generations, the script was killed by system (literally "killed" is the only message depicted in console).

by running dmesg, I got

[169102.053126] Out of memory: Killed process 446321 (python) total-vm:194467732kB, anon-rss:107278388kB, file-rss:0kB, shmem-rss:4kB, UID:1000 pgtables:225528kB oom_score_adj:0
[169105.064216] oom_reaper: reaped process 446321 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

The dataset is small (232 rows and 108 columns). A similar program in AutoSklearn consumes about 18GB ram. Is it possible that there is a memory leak?

I'm using Python=3.7.9 and tpot=0.11.7. I tried n_jobs=1, use_dask=False, but the error remains the same.

weixuanfu commented 3 years ago

Hmm, it is strange that such a small dataset caused memory leak and it is also a unusual that autosklearn needed 18Gb. Could you please provide more information to reproduce this issue?

pedrogarciafreitas commented 3 years ago

Hi,

Using the following dataset: reference_APSIPA.zip

The following script:

from tpot import TPOTRegressor
import pandas as pd
import numpy as np

df = pd.read_csv('reference_APSIPA.zip')
features = [col for col in df if col.startswith('d_')]
X_train, y_train = df[features].values, df.SCORE.values
tpot = TPOTRegressor(generations=500, population_size=50,
                     verbosity=2, n_jobs=16, use_dask=True)
tpot.fit(X_train, y_train)

Crashes after 2350 iterations:

$ python tpot_from_dataset.py 

Generation 1 - Current best internal CV score: -0.4186788503982938

Generation 2 - Current best internal CV score: -0.4186788503982938

Generation 3 - Current best internal CV score: -0.4186788503982938

Generation 4 - Current best internal CV score: -0.4186788503982938

Generation 5 - Current best internal CV score: -0.4186788503982938

Generation 6 - Current best internal CV score: -0.40041785046702455

Generation 7 - Current best internal CV score: -0.36806708236239455

Generation 8 - Current best internal CV score: -0.360919010590871

Generation 9 - Current best internal CV score: -0.360919010590871

Generation 10 - Current best internal CV score: -0.360919010590871

Generation 11 - Current best internal CV score: -0.360919010590871

Generation 12 - Current best internal CV score: -0.360919010590871

Generation 13 - Current best internal CV score: -0.360919010590871

Generation 14 - Current best internal CV score: -0.360919010590871

Generation 15 - Current best internal CV score: -0.360919010590871

Generation 16 - Current best internal CV score: -0.360919010590871

Generation 17 - Current best internal CV score: -0.360919010590871

Generation 18 - Current best internal CV score: -0.360919010590871

Generation 19 - Current best internal CV score: -0.35142348326471307

Generation 20 - Current best internal CV score: -0.34095032052401014

Generation 21 - Current best internal CV score: -0.34095032052401014

Generation 22 - Current best internal CV score: -0.34095032052401014

Generation 23 - Current best internal CV score: -0.34095032052401014

Generation 24 - Current best internal CV score: -0.34004340266687294

Generation 25 - Current best internal CV score: -0.3299944198542487

Generation 26 - Current best internal CV score: -0.3299944198542487

Generation 27 - Current best internal CV score: -0.3299944198542487

Generation 28 - Current best internal CV score: -0.3299944198542487

Generation 29 - Current best internal CV score: -0.3299944198542487

Generation 30 - Current best internal CV score: -0.3299944198542487

Generation 31 - Current best internal CV score: -0.32862536127972203

Generation 32 - Current best internal CV score: -0.32862536127972203

Generation 33 - Current best internal CV score: -0.32862536127972203

Generation 34 - Current best internal CV score: -0.32862536127972203

Generation 35 - Current best internal CV score: -0.3242168683350396

Generation 36 - Current best internal CV score: -0.3242168683350396

Generation 37 - Current best internal CV score: -0.3242168683350396

Generation 38 - Current best internal CV score: -0.3242168683350396

Generation 39 - Current best internal CV score: -0.3242168683350396

Generation 40 - Current best internal CV score: -0.3242168683350396

Generation 41 - Current best internal CV score: -0.3242168683350396

Generation 42 - Current best internal CV score: -0.3242168683350396

Generation 43 - Current best internal CV score: -0.3242168683350396

Generation 44 - Current best internal CV score: -0.3242168683350396

Generation 45 - Current best internal CV score: -0.3242168683350396

Generation 46 - Current best internal CV score: -0.3242168683350396
Optimization Progress:   9%|█████████▋                                                                                             | 2350/25050 [5:28:59<27:53:07,  4.42s/pipeline]
Killed

JDRomano2 commented 3 years ago

Thanks @pedrogarciafreitas , I'm going to try to replicate the issue.

Would you mind providing your OS and OS version please?

pedrogarciafreitas commented 3 years ago

Hi @JDRomano2

I'm using Ubuntu 20.04.1 LTS Kernel 5.4.45-050445-generic #202006070831 and conda 4.9.2 Numpy 1.19.2 Scipy 1.5.2 Pandas 1.2.0 Xgboost 1.3.0 Joblib 1.0.0 Tpot 0.11.7 Sklearn 0.23.2

JDRomano2 commented 3 years ago

I can't replicate the issue in MacOS 11, so I'm suspicious that it's on OS or kernel-related memory leak. I'll see if I encounter the same bug in Ubuntu 20.04, like you are using, and report back.

thepr0blem commented 3 years ago

Hi @JDRomano2, any updates here? I am having the same issue on ubuntu 20.04

BJOERNTONN commented 3 years ago

I have the same Problem and i am on Ubuntu 20.04.3 LTS, too After 10 h Jobrunning - Working as root.

Optimization Progress: 11%|████████████▉ | 336/3000 [9:19:43<94:12:09, 127.30s/pipeline] Killed

Greetz Bjoern

EpistasisLab / tpot

Process was killed #1162