EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.58k stars 1.55k forks source link

Feature selection using TPOT - implement a subset selector? #712

Open joseortiz3 opened 6 years ago

joseortiz3 commented 6 years ago

The previous "feature selection" question reminded me to ask this slightly different question.

I was wondering roughly how one might go about implementing a "SubsetSelector" pipeline element to go in the very first stage of the pipeline. The job of this element would be to pass only certain subsets of the features onwards. If someone would "point the way", I would gladly try to implement this.

As an example, lets say our six features are [f1, f2, f3, g1, g2, g3]. I would define two families [f1, f2, f3] and [g1, g2, g3]. The possible hyper-parameters ("DNA") of this SubsetSelector would be [1,0] for only the f features, [0,1] for only the g features, and [1,1] for both features. The hyper-parameter of this SubsetSelector element seems to be well-suited for genetic optimization.

This pipeline element is not (?) offered by sklearn (maybe because it is trivial and useless in that context), but I think it makes a lot of sense in the context of tpot.

Thoughts? Suggestions? I'm new to genetic programming, tpot, and pipeline optimization, so your thoughts would be greatly appreciated.

weixuanfu commented 6 years ago

The SubsetSelector in TPOT (we called it DatasetSelector) is still under development/testing for now. The method is similar as you mentioned in the issue. But we made this selector not as a general feature selector but as selector for GWAS or bioinformatics for taking tons of subsets. But it could be used in general case. I will add more details in docs in that branch.

joseortiz3 commented 6 years ago

I've been trying to understand how pipeline elements work with TPOT/DEAP (it seems almost magical from the outside). I would like to ask some questions about how your class works:

I see from your class definition for DatasetSelector you made it inherit from two sklearn superclasses: class DatasetSelector(BaseEstimator, TransformerMixin): I have learned from reading the TPOT code that TransformerMixin is a superclass that TPOT uses to identify transforming nodes (operators) in the pipeline, so I suppose that is why DatasetSelector needs to inherit from TransformerMixin? But I don't know why BaseEstimator is also necessary, maybe it's for sklearn?

I see the DatasetSelector has both fit() and transform() methods defined, in which fit() determines which of the currently-available features passed to it will be selected, and transform() selects this subset. What would happen if fit() was not defined? Do all pipeline elements have to have a fit() function?

I am curious what the entry in the config-dictionary would be for your class would be? Would it look something like the following?:

    'tpot.builtins.DatasetSelector': {
        'subset_dir': ['path/to/directory']
        'sel_subset_fname': ['subsets_1.csv', 'subsets_2.csv', 'subsets_3.csv']
    },

I also have some questions about how TPOT uses this class:

How does TPOT mutate, cross, mate, etc using this dictionary? My current understanding is mutation switches one node in a pipeline tree with a randomly-generated one, cross switches subtrees between trees, and I mating (?) does something similar.

In contrast, I want a list of n booleans ([1, 0, 0, 1, 0 ....]) to control the inclusion of n subsets. I want crossover/mutation/genetic-stuff to occur to this list of booleans, using it as "DNA", if that makes sense. I suppose this is very different than the genetic operators currently defined in TPOT.

So maybe my idea is difficult to implement with TPOT. I'm working on it as an independent deap project currently, in which only the selection is optimized, while the rest of the pipeline is fixed.

Thanks.