Open Whamp opened 5 years ago
How large is your dataset? I doubt the dask backend maybe crashed somehow after using all the resources. I will double-check it next week.
So I don't think it's the size that's the problem, I think it's only 2gb. I think the issue may be the time series cross-validation that I'm using. I'm splitting the DataFrame by date with 1964 to 2012 in the training set and each month from 2012 to 2018 so roughly 72 monthly cross-validation splits which I assume means each model needs to be retrained 72 times which is sort of dumb but also necessary for my problem.
Here is my code for that:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import numpy as np
class TimeSeriesSplitMonthLag():
def __init__(self, date_col='Date',init_trn_date=None, lag=12):
"""
date_col: string name of column in dataframe containing dates
init_trn_date: string in format mm/dd/yyy for the first train split
lag = number of months to lag forward the test set to handle forward lagged targets
Example:
>>> splitter = TimeSeriesSplitDateLag(date_col= 'Date',init_trn_date= "05/31/2012",lag= 12)
>>> for trn_idx, tst_idx in splitter.split(df):
... print(f"trn index max: {trn_idx.max()}, month {df.loc[trn_idx,'Date'].max()}")
... print(f"tst index min: {tst_idx.min()}, month {df.loc[tst_idx,'Date'].min()}")
trn index max: 1344164, month 2018-02-28
tst index min: 1339, month 2019-02-28
trn index max: 1344262, month 2018-03-31
tst index min: 1340, month 2019-03-31
trn index max: 1344483, month 2018-04-30
tst index min: 1341, month 2019-04-30
"""
self.init_trn_date = pd.to_datetime(init_trn_date)
self.lag = lag
self.Date = date_col
def split(self, df):
"""
df: dataframe in single index format
"""
max_tst_date = pd.to_datetime(df[self.Date].unique()).max()
max_trn_date = max_tst_date - MonthEnd(self.lag)
trn_date_range = pd.date_range(start= self.init_trn_date, end= max_trn_date,freq= 'M')
for date in trn_date_range:
trn_idxs = df.loc[df[self.Date] <= date,:].index.values
tst_idxs = df.loc[df[self.Date] == date + MonthEnd(self.lag),:].index.values
yield (trn_idxs, tst_idxs)
splitter = TimeSeriesSplitMonthLag(date_col='Date', init_trn_date='05/31/2012', lag=12)
#passed as a paramter to tpot:
cv = list(splitter.split(X_trn))
I'm going to try running a small sample of the dataset with fewer cv monthly splits just to test.
Just to update, i tried running with only 12 monthly cross val splits with 1 generation and population of 2 and it had the same behavior. Never shows a completed pipeline and cpu usage goes from heavy to nothing. I tried running with dask on an off with the same behavior.
changing a few more parameters and setting n_jobs to 4 from -2 i finally got an error message:
model_pipe = tpot.TPOTRegressor(generations=1,
population_size=2,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=list(splitter.split(X_trn)),
subsample=1.0,
n_jobs=4,
max_time_mins=24*60,
max_eval_time_mins=20,
random_state=42,
config_dict=None,
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=False,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',
early_stop=10,
verbosity=4,
disable_update_check=False)
model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values))
Relate to issue #876
It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?
pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development
We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.
Unfortunately, I'm already running: joblib = 0.13.2 scikit-learn = 0.21.1
TPOT = 0.10.1 just in case dask is involved: dask = 1.2.2 dask-core = 1.2.2 dask-glm = 0.1.0 dask-ml = 0.13.0
I'll create a new conda env with the dev branch and revert back to you.
The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.
The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.
oh ok perfect, i'll let you know once i've tested on the dev branch. And thank you for all your help troubleshooting by the way.
ok, i re-ran in a new environment with the dev branch of tpot and the behavior is unchanged. Is there a sample dataset that is known to work for tpot regression with time series cv splits?
I guess I can always add a dummy column of dates to the Boston dataset just to get things running. Is there a base parameter configuration you would recommend for this?
Your configuration above looks fine to me. Also you may try to use config_dict="TPOT light" or/and n_jobs=4 for a quick test on your dataset/Boston datasets to checking whether this was a resource problem?
model_pipe = tpot.TPOTRegressor(generations=100,
population_size=100,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=list(splitter.split(X_trn)),
subsample=1.0,
n_jobs=4,
max_time_mins=24*60,
max_eval_time_mins=20,
random_state=42,
config_dict="TPOT light",
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=True,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
early_stop=10,
verbosity=3,
disable_update_check=False)
So I've been able to test the dev branch a little more. I was able to get TPOT to return exported pipelines with the standard tpot config and these parameters. I think n_jobs =4 was an important change from n_jobs=-2 but it's tough to tell with the silent failures and no error message.
model_pipe = tpot.TPOTRegressor(generations=1,
population_size=2,
offspring_size=None,
mutation_rate=0.9,
crossover_rate=0.1,
scoring='neg_mean_squared_error',
cv=trn_val_split_idxs,
subsample=1.0,
n_jobs= 4,
max_time_mins=None,
max_eval_time_mins=5,
random_state=42,
config_dict=None,
template="RandomTree",
warm_start=False,
memory='auto',
use_dask=True,
periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',
early_stop=10,
verbosity=4,
disable_update_check=False)
but when I tried to run the tpot light config I got the same silent failure where tpot thought it was running but no jobs were running on the CPU.
Does the solution (https://github.com/EpistasisLab/tpot/issues/876#issuecomment-499626973) from another issue work for you?
Does the solution (#876 (comment)) from another issue work for you?
That had no effect for me. It's never made it beyond 0% on the progress bar.
Possible user error since I'm a new tpot user but I've tried to run what I expected to be a large job and I think the job has failed but the kernel is still busy and I haven't received any errors. I think it's failed because it went from occupying all CPU processes and nearly all the memory to none.
I've run the Boston training set with no problems so I think my configuration is good.
Here is my code for the job I think failed. I'm running in a Jupyter notebook, and it's been running for about 4.5 hours
I'm running tpot 0.10.1 installed via conda
output from htop on ubuntu 18.04:
I wanted to highlight this as I searched and couldn't find a similar issue.