EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

XGBoost supports NaN but tpot enforces imputation #836

Open trhallam opened 5 years ago

trhallam commented 5 years ago

As a general rule tpot enforces imputation to match sklearn requirements for all real values in the input and output data. XGboost as a special case allows for the input of NaN values.

Context of the issue

I am trying to optimise XGboost specifically using a data set with quite a lot of holes in it. I do not want to perform imputation as it affects the results. I looked in base.py and quickly modified the _check_data function to ignore NaN values and to not perform imputation but was wondering if tpot can be modified to accommodate this scenario with XGboost?

A 'no_imputation' keyword might be added to TPOTBase .__init__ for example to prevent imputation.

Example Edits:

else:
            if not self._imputed and np.any(np.isnan(features)):
                self._imputed = True
                features = self._impute_values(features)

        try:
            if target is not None:
                X, y = check_X_y(features, target, accept_sparse=True, dtype=np.float64, 

force_all_finite='allow-nan')

                return X, y
weixuanfu commented 5 years ago

TPOT enforces imputation on dataset with NaN because most operators in TPOT configuration do not support NaN. We may need another configuration if this no_impuation option is added.

trhallam commented 5 years ago

I understand, it is a very specific case that I'm working on. Just currently there is no way to escape imputation with TPOT unless you modify the source. It is not a necessity perhaps more a nice to have.