TPOTRegressor Thinks its Running HTOP disagrees

Whamp commented 5 years ago

Possible user error since I'm a new tpot user but I've tried to run what I expected to be a large job and I think the job has failed but the kernel is still busy and I haven't received any errors. I think it's failed because it went from occupying all CPU processes and nearly all the memory to none.

I've run the Boston training set with no problems so I think my configuration is good.

Here is my code for the job I think failed. I'm running in a Jupyter notebook, and it's been running for about 4.5 hours

I'm running tpot 0.10.1 installed via conda

model_pipe = tpot.TPOTRegressor(generations=100, 
                   population_size=100,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=-2,                        
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                  
                   config_dict=None,
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                    
                   use_dask=True,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
                   early_stop=10,                    
                   verbosity=3,                      
                   disable_update_check=False)

model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values))

output from htop on ubuntu 18.04:

I wanted to highlight this as I searched and couldn't find a similar issue.

weixuanfu commented 5 years ago

How large is your dataset? I doubt the dask backend maybe crashed somehow after using all the resources. I will double-check it next week.

Whamp commented 5 years ago

So I don't think it's the size that's the problem, I think it's only 2gb. I think the issue may be the time series cross-validation that I'm using. I'm splitting the DataFrame by date with 1964 to 2012 in the training set and each month from 2012 to 2018 so roughly 72 monthly cross-validation splits which I assume means each model needs to be retrained 72 times which is sort of dumb but also necessary for my problem.

Here is my code for that:

import pandas as pd
from pandas.tseries.offsets import MonthEnd
import numpy as np

class TimeSeriesSplitMonthLag():
    def __init__(self, date_col='Date',init_trn_date=None, lag=12):
        """
        date_col: string name of column in dataframe containing dates
        init_trn_date: string in format mm/dd/yyy for the first train split
        lag = number of months to lag forward the test set to handle forward lagged targets

        Example:
        >>> splitter = TimeSeriesSplitDateLag(date_col= 'Date',init_trn_date= "05/31/2012",lag= 12)
        >>> for trn_idx, tst_idx in splitter.split(df):
        ...    print(f"trn index max: {trn_idx.max()}, month {df.loc[trn_idx,'Date'].max()}")
        ...    print(f"tst index min: {tst_idx.min()}, month {df.loc[tst_idx,'Date'].min()}")
        trn index max: 1344164, month 2018-02-28 
        tst index min: 1339,    month 2019-02-28 
        trn index max: 1344262, month 2018-03-31 
        tst index min: 1340,    month 2019-03-31 
        trn index max: 1344483, month 2018-04-30 
        tst index min: 1341,    month 2019-04-30 

        """
        self.init_trn_date = pd.to_datetime(init_trn_date)
        self.lag = lag
        self.Date = date_col

    def split(self, df):
        """
        df: dataframe in single index format
        """
        max_tst_date = pd.to_datetime(df[self.Date].unique()).max()
        max_trn_date = max_tst_date - MonthEnd(self.lag)
        trn_date_range = pd.date_range(start= self.init_trn_date, end= max_trn_date,freq= 'M')

        for date in trn_date_range:
            trn_idxs = df.loc[df[self.Date] <= date,:].index.values
            tst_idxs = df.loc[df[self.Date] == date + MonthEnd(self.lag),:].index.values
            yield (trn_idxs, tst_idxs)

splitter = TimeSeriesSplitMonthLag(date_col='Date', init_trn_date='05/31/2012', lag=12)

#passed as a paramter to tpot:
cv = list(splitter.split(X_trn))

I'm going to try running a small sample of the dataset with fewer cv monthly splits just to test.

Whamp commented 5 years ago

Just to update, i tried running with only 12 monthly cross val splits with 1 generation and population of 2 and it had the same behavior. Never shows a completed pipeline and cpu usage goes from heavy to nothing. I tried running with dask on an off with the same behavior.

Whamp commented 5 years ago

changing a few more parameters and setting n_jobs to 4 from -2 i finally got an error message:

model_pipe = tpot.TPOTRegressor(generations=1, 
                   population_size=2,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=4,                       
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                 
                   config_dict=None,
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                   
                   use_dask=False,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp', 
                   early_stop=10,                    
                   verbosity=4,                      
                   disable_update_check=False)

model_pipe.fit(X_trn.drop(columns='Date'), y_trn.drop(columns='Date').values)
print(model_pipe.score(X_tst.drop(columns='Date'), y_tst.drop(columns='Date').values))

weixuanfu commented 5 years ago

Relate to issue #876

It seems that there is a kind of threading deadlock issue (maybe related to this old issue in joblib). Could you please try to update joblib (> 0.13.2) and scikit-learn (>=0.21) via conda or pip and reinstall TPOT development branch via the command below?

pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development

We recently noticed that the internal joblib module (based on a older version of joblib) in scikit-learn (<0.20) was deprecated (see #867) and may cause the issue here because it did not have some important updates about limiting the number of threads in joblib (>0.12, see joblib change log). LMK if this solution works or now.

Whamp commented 5 years ago

Unfortunately, I'm already running: joblib = 0.13.2 scikit-learn = 0.21.1

TPOT = 0.10.1 just in case dask is involved: dask = 1.2.2 dask-core = 1.2.2 dask-glm = 0.1.0 dask-ml = 0.13.0

I'll create a new conda env with the dev branch and revert back to you.

weixuanfu commented 5 years ago

The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.

Whamp commented 5 years ago

The TPOT (v0.10.1) only uses the joblib built in scikit-learn instead of the (joblib 0.13). The development branch of TPOT should use the joblib module due to a recently merge #867.

oh ok perfect, i'll let you know once i've tested on the dev branch. And thank you for all your help troubleshooting by the way.

Whamp commented 5 years ago

ok, i re-ran in a new environment with the dev branch of tpot and the behavior is unchanged. Is there a sample dataset that is known to work for tpot regression with time series cv splits?

I guess I can always add a dummy column of dates to the Boston dataset just to get things running. Is there a base parameter configuration you would recommend for this?

weixuanfu commented 5 years ago

Your configuration above looks fine to me. Also you may try to use config_dict="TPOT light" or/and n_jobs=4 for a quick test on your dataset/Boston datasets to checking whether this was a resource problem?

model_pipe = tpot.TPOTRegressor(generations=100, 
                   population_size=100,                        
                   offspring_size=None, 
                   mutation_rate=0.9,                       
                   crossover_rate=0.1,
                   scoring='neg_mean_squared_error',
                   cv=list(splitter.split(X_trn)),   
                   subsample=1.0,
                   n_jobs=4,                        
                   max_time_mins=24*60,              
                   max_eval_time_mins=20,           
                   random_state=42,                  
                   config_dict="TPOT light",
                   template="RandomTree",
                   warm_start=False,
                   memory='auto',                    
                   use_dask=True,                    
                   periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp'
                   early_stop=10,                    
                   verbosity=3,                      
                   disable_update_check=False)

Whamp commented 5 years ago

So I've been able to test the dev branch a little more. I was able to get TPOT to return exported pipelines with the standard tpot config and these parameters. I think n_jobs =4 was an important change from n_jobs=-2 but it's tough to tell with the silent failures and no error message.

model_pipe = tpot.TPOTRegressor(generations=1, 
                                population_size=2,                        
                                offspring_size=None, 
                                mutation_rate=0.9,                       
                                crossover_rate=0.1,
                                scoring='neg_mean_squared_error', 
                                cv=trn_val_split_idxs,                                         
                                subsample=1.0,
                                n_jobs= 4,                                                     
                                max_time_mins=None,                                            
                                max_eval_time_mins=5,                                        
                                random_state=42,                                            
                                config_dict=None,
                                template="RandomTree",
                                warm_start=False,
                                memory='auto',                                                 
                                use_dask=True,                                                
                                periodic_checkpoint_folder='/home/will/Dalton/USA/tpot_tmp',   
                                early_stop=10,                                                 
                                verbosity=4,                                                  
                                disable_update_check=False)

but when I tried to run the tpot light config I got the same silent failure where tpot thought it was running but no jobs were running on the CPU.

weixuanfu commented 5 years ago

Does the solution (https://github.com/EpistasisLab/tpot/issues/876#issuecomment-499626973) from another issue work for you?

Whamp commented 5 years ago

Does the solution (#876 (comment)) from another issue work for you?

That had no effect for me. It's never made it beyond 0% on the progress bar.

EpistasisLab / tpot

TPOTRegressor Thinks its Running HTOP disagrees #875