EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

Is possible to have access to the features used in generations after selection, preprocessing and construction? #640

Open jonathansantilli opened 6 years ago

jonathansantilli commented 6 years ago

Hello, thanks for TPOT, is really amazing!

I was looking this information in other questions, no found yet,

TPOT execute Feature selection, preprocessing and construction according to the documentation:

screen shot 2017-12-17 at 09 28 30

Is it possible to have access to those finally selected features? even better, is possible to have them after each generation?

Generation 1 - Current best internal CV score: 0.88888....
Best pipeline so far: ...
Used features: ... (exported to a file PATH/TO/USED_FEATURES)

Generation 2 - Current best internal CV score: 0.88899....
Best pipeline so far: ...
Used features: ... (exported to a file PATH/TO/USED_FEATURES)
...

After executing the export:

tpot.export('tpot_pipeline.py')

It will produce the code:

...
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
...

It means that separately we have to provide the data to the selected pipeline, instead of using the features used and generated by TPOT, is that Ok or am missing something?

Ideally (in my opinion), would be extraordinarily helpful for research propose, to know the best pipeline and features used after each generation. This could be used to analyze the progress improvement.

Thanks a lot for in advance.

weixuanfu commented 6 years ago

Hi, thank you for the interesting idea here. So far, TPOT does not have a API to export the features selected in the best pipeline. We used feature selectors from scikit-learn and put them into scikit-learn Pipeline objects. I think you could access the selected features within the best Pipeline object. Please check the example in the link for scikit-learn Pipeline (see codes after "# getting the selected features chosen by anova_filter")

jonathansantilli commented 6 years ago

Hello @weixuanfu thanks for the reply, @rhiever have opened this issue https://github.com/rhiever/tpot/issues/629 that maybe could be oriented in this direction as well? I mean, sort of final and middle explanation about what is happening.

sunils27 commented 6 years ago

This will be a very useful enhancement.

IamAVB commented 6 years ago

I understood that TPOT involves hyperparameters, different classification models while creating pipelines using GP. But does it involve features too? means does mutation, crossover operators are applied on feature sets too while creating each new pipeline?.

dsleo commented 6 years ago

Hello,

Thanks for the very nice library. I would love to work on a adding this as a feature to the API. Could you point me to where the selection/construction of features is happening ?

weixuanfu commented 6 years ago

@dsleo Thank you. The feature selection/construction performs within the scikit-learn pipelines which are generated via GP in TPOT.

dsleo commented 6 years ago

@weixuanfu any specific pointer in the code base to look at ?
I've started but I'm not very familiar with tpot and I haven't yet found where this is happening. I was hoping that _pop attribute of the TPOTClassifier() object would contain useful information about the population and hence the selected features as in here...Is it directly in eaMuPlusLambda that I should try to modify popattributes to retain informations about the features ?

Any help would be greatly appreciated, thanks !

weixuanfu commented 6 years ago

Currently, some statistics for evaluated pipelines are saved into evaluated_individuals_ via this function