StatguyUser / TextFeatureSelection

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
MIT License
50 stars 5 forks source link

Problem with TextFeatureSelectionGA #15

Closed heriistantoo closed 2 years ago

heriistantoo commented 2 years ago

Hello, I just read your paper on feature selection with genetic algorithms and am interested in trying the code. But when I try the code below:

doc_list = ['i had dinner','i am on vacation','I am happy','Wastage of time']
label_list = ['Neutral','Neutral','Positive','Negative']

fsGA = TextFeatureSelectionGA(generations=10, population=5, percentage_of_token=60, runtime_minutes=2)
best_vocabulary = fsGA.getGeneticFeatures(doc_list=doc_list, label_list=label_list)

I find the code never finishes computing, has been waiting for hours and it doesn't finish. Can you please explain why this happened? and can you help me choose a parameters that can produce output quickly, I'm curious what kind of output the algorithm gives.

Thank you.

StatguyUser commented 2 years ago

Thanks for reaching out. Below is my response.

1) Genetic algorithm tries to find a set of word tokens that give the best performance. While doing so it does 5 fold cross-validation to assess model stability. The genetic algorithm core module tries hundreds of different combinations. So you can multiply the time taken for training a single model multiplied by 5 and multiplied by a few hundred. In short, it is a time-consuming process.

2) The below example is not suitable for a small example of a few records. But a real-world dataset.

doc_list = ['i had dinner','i am on vacation','I am happy','Wastage of time'] label_list = ['Neutral','Neutral','Positive','Negative']

If you can build your own logistic regression model with a TF-IDF vector, then consider feeding it to the module.

As a side remark, feature selection does not yield desirable results when done in isolation. To get the best possible results, try ensembling different models and features. While doing so, perform feature selection. Please check the third module TextFeatureSelectionEnsemble. It does exactly that.

Thanks!