StatguyUser / TextFeatureSelection

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
MIT License
50 stars 5 forks source link
feature-selection machine-learning machinelearning natural-language natural-language-generation natural-language-inference natural-language-processing natural-language-understanding naturallanguageprocessing nlp nlp-library nlp-machine-learning nlp-resources nlproc text-categorization text-classification

What is it?

Companion library of machine learning book Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists.

TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 3 methods TextFeatureSelection, TextFeatureSelectionGA and TextFeatureSelectionEnsemble methods respectively.

First method: TextFeatureSelection

It follows the filter method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.

It has below parameters

How to use is it?

from TextFeatureSelection import TextFeatureSelection

#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)

#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)

Second method: TextFeatureSelectionGA

It follows the genetic algorithm method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.

Its parameters are divided into 2 groups.

a) Genetic algorithm parameters: These are provided during object initialization.

b) Machine learning model and tfidf parameters: These are provided during function call.

Data Parameters

How to use is it?

from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)

Third method: TextFeatureSelectionEnsemble

TextFeatureSelectionEnsemble helps ensemble multiple models to find best model combination with highest performance.

It uses grid search and document frequency for reducing vector size for individual models. This makes individual models less complex and computationally faster. At the ensemble learning layer, metaheuristics algorithm is used for identifying the smallest possible combination of individual models which has the highest impact on ensemble model performance.

Base Model Parameters

Metaheuristic algorithm feature selection parameters for ensemble model

How to use is it?


imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)

# convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()

# Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()

Where to get it?

pip install TextFeatureSelection

How to cite

Md Azimul Haque (2022). Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists. Lulu Press, Inc.

Dependencies

References