Problem with TextFeatureSelectionGA

StatguyUser / TextFeatureSelection

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models

MIT License

50 stars 5 forks source link

doc_list = ['i had dinner','i am on vacation','I am happy','Wastage of time'] label_list = ['Neutral','Neutral','Positive','Negative'] fsGA = TextFeatureSelectionGA(generations=10, population=5, percentage_of_token=60, runtime_minutes=2) best_vocabulary = fsGA.getGeneticFeatures(doc_list=doc_list, label_list=label_list)

Thanks for reaching out. Below is my response.

1) Genetic algorithm tries to find a set of word tokens that give the best performance. While doing so it does 5 fold cross-validation to assess model stability. The genetic algorithm core module tries hundreds of different combinations. So you can multiply the time taken for training a single model multiplied by 5 and multiplied by a few hundred. In short, it is a time-consuming process.

2) The below example is not suitable for a small example of a few records. But a real-world dataset.

doc_list = ['i had dinner','i am on vacation','I am happy','Wastage of time'] label_list = ['Neutral','Neutral','Positive','Negative']

If you can build your own logistic regression model with a TF-IDF vector, then consider feeding it to the module.

As a side remark, feature selection does not yield desirable results when done in isolation. To get the best possible results, try ensembling different models and features. While doing so, perform feature selection. Please check the third module TextFeatureSelectionEnsemble. It does exactly that.

Thanks!

StatguyUser / TextFeatureSelection

Problem with TextFeatureSelectionGA #15