EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.65k stars 1.56k forks source link

Missing feature transformers and shallow pipelines #1029

Open Genises opened 4 years ago

Genises commented 4 years ago

Hello, I want to use TPOT for feature engineering. Therefore, I chose a fixed model for TPOT like a linear regression model and the default configuration.

Having some features {x1,x2,…}, there are no feature transformation steps/operators in TPOT that could produce new features such as 5 * (x2 + log(x1))**3 or even just abs(x1 - x2), right?

Testing TPOT on synthetic data (where I know the target function) often results in many more and seemingly overly complex features. E.g. produced by a single RBFSampler operator and such.

Also, even if such non-linear feature transformation operators (|x|, exp(x), sin(x), cos(x), abs(x)) together with combination operators (+, −, ·) were part of TPOT, could a feature like 5 * (x2 + log(x1))**3 even be constructed? All my initial pipelines are very shallow and due to the multi objective optimisation and greedy evolutionary approach do not get bigger. What is with scenarios where multiple operators would need to be introduced at the same time to improve accuracy and to be part of the Pareto front?

weixuanfu commented 4 years ago

Those complex combinations of feature transformation were not supported in TPOT. I think ColumnTransformer is needed for this idea in this issue. Also, as mentioned above, because multi-object optimization should penalize the pipelines with a large number of operators and limited improvement in scores, selection function or calculation of pipeline complexity should be changed for this issue. But we don't have a near plan to include this function. Any contributions are welcome.

ruialcn commented 2 years ago

Those complex combinations of feature transformation were not supported in TPOT. I think ColumnTransformer is needed for this idea in this issue. Also, as mentioned above, because multi-object optimization should penalize the pipelines with a large number of operators and limited improvement in scores, selection function or calculation of pipeline complexity should be changed for this issue. But we don't have a near plan to include this function. Any contributions are welcome.

need complex feature transformation on my current project, I can use ColumnTransformer in sklearn, but it seems not supported in tpot yet.

really need it!