Companion library of machine learning book Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists.
TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 3 methods TextFeatureSelection
, TextFeatureSelectionGA
and TextFeatureSelectionEnsemble
methods respectively.
It follows the filter
method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.
It has below parameters
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
It follows the genetic algorithm
method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.
Its parameters are divided into 2 groups.
a) Genetic algorithm parameters: These are provided during object initialization.
b) Machine learning model and tfidf parameters: These are provided during function call.
Data Parameters
doc_list text documents in a python list. Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
label_list labels in a python list. Example: ['Neutral','Neutral','Positive','Negative']
Modelling Parameters
model Set a model which has .fit function to train model and .predict function to predict for test data. This model should also be able to train classifier using TfidfVectorizer feature. Default is set as Logistic regression in sklearn
model_metric Classifier cost function. Select one from: ['f1','precision','recall']. Default is F1
avrg Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary']. For binary classification, default is 'binary' and for multi-class classification, default is 'micro'.
TfidfVectorizer Parameters
analyzer {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
min_df float or int, default=2 When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_df float or int, default=1.0 When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
stop_words {'english'}, list, default=None
If a string, it is passed to _check_stop_list and the appropriate stop
list is returned. 'english' is currently the only supported string
value.
There are several known issues with 'english' and you should
consider an alternative (see :ref:stop_words
).
If a list, that list is assumed to contain stop words, all of which
will be removed from the resulting tokens.
Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value
in the range [0.7, 1.0) to automatically detect and filter stop
words based on intra corpus document frequency of terms.
tokenizer callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'
token_pattern str, default=r"(?u)\b\w\w+\b" Regular expression denoting what constitutes a "token", only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
lowercase bool, default=True Convert all characters to lowercase before tokenizing.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
TextFeatureSelectionEnsemble helps ensemble multiple models to find best model combination with highest performance.
It uses grid search and document frequency for reducing vector size for individual models. This makes individual models less complex and computationally faster. At the ensemble learning layer, metaheuristics algorithm is used for identifying the smallest possible combination of individual models which has the highest impact on ensemble model performance.
Base Model Parameters
doc_list Python list with text documents for training base models
label_list Python list with Y labels
use_class_weight Boolean value representing if you want to apply class weight before training classifiers. Default is False.
pickle_path Path where base model, text feature vectors and ensemble models will be saved in PC.
save_data Boolean True | False. Default is False. Whether datasets used for training base model, and ensemble models will be saved in PC.
n_crossvalidation How many cross validation samples to be created. Higher value will result more time for model training. Lower number will result in less reliable model. Default is 5.
seed_num Seed number for training base models as well as for creating cross validation data. Default is 1.
stop_words Stop words for count and tfidf vectors. Default is None.
lowercase Lowercasing for text in count and tfidf vector. Default is True
n_jobs How many jobs to be run in parallel for training sklearn and xgboost models. Default is -1
cost_function Cost function to optimize base models. During feature selection using grid search for base models, this cost function is used for identifying which words to be removed based on combination of lower and higer document frequency for words. Available options are 'f1', 'precision', 'recall'. Default is 'f1'
average What averaging to be used for cost_function. Useful for multi-class classifications. Available options are 'micro','macro','samples','weighted' and 'binary' Default is 'binary'.
basemodel_nestimators How many n_estimators. Used as a parameter for tree based models such as 'XGBClassifier','AdaBoostClassifier','RandomForestClassifier','ExtraTreesClassifier'. Default is 500.
feature_list Type of features to be used for ensembling. Available options are 'Unigram','Bigram','Trigram'. Default is ['Unigram','Bigram','Trigram']
vector_list Type of text vectors from sklearn to be used. Available options are 'CountVectorizer','TfidfVectorizer'. Default is ['CountVectorizer','TfidfVectorizer']
base_model_list List of machine learning algorithms to be trained as base models for ensemble layer training. Available options are 'LogisticRegression','XGBClassifier','AdaBoostClassifier','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier' Default is ['LogisticRegression','XGBClassifier','AdaBoostClassifier','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier']
Metaheuristic algorithm feature selection parameters for ensemble model
method Which method you want to specify for metaheuristics feature selection. The available methods are 'ga', 'sa', 'aco', and 'pso'. These stand for genetic algorithm, simulated annealing, ant colony optimization, and particle swarm optimization respectively. You can select one out of the 4. Default is 'ga'.
MetaHeuristicsParameters Parameters for the metaheuristics feature selection method for ensemble learning. This is used for identifying best combination of base models for ensemble learning. It helps remove models which has no contribution for ensemble learning and keep only important models.
FeatureSelection
module is used from MetaHeuristicsFS
python library.
Refer documentation for MetaHeuristicsFS
at: https://pypi.org/project/MetaHeuristicsFS/ and example usage of MetaHeuristicsFS for feature selection: https://github.com/StatguyUser/feature_engineering_and_selection_for_explanable_models/blob/37ba0d2921fbabbb83df44c6eb7a1242b19a637f/Chapter%208%20-%20Hotel%20Cancelation%20.ipynb
Parameters used are Parameters used are
{"model_object": LogisticRegression(n_jobs=-1,random_state=1),
"cost_function":f1_score,
"average":'micro',
"cost_function_improvement":'increase',
"ga_parameters":{"generations":50,
"population":50,
"prob_crossover":0.9,
"prob_mutation":0.1,
"run_time":120},
"sa_parameters":{"temperature":1500,
"iterations":50,
"n_perturb":1,
"n_features_percent_perturb":1,
"alpha":0.9,
"run_time":120},
"aco_parameters":{"iterations":50,
"N_ants":50,
"evaporation_rate":0.9,
"Q":0.2,
"run_time":120},
"pso_parameters":{"iterations":50,
"swarmSize":50,
"run_time":120}
}
Output are saved in 4 folders
model It has base models
vector it has count and tfidf vectors for each model
ensemble_model It has ensemble model
deleted It has base model and vectors for models which were discarded by genetic algorithm.
data_files It has list of data files used for training base models, and ensemble model
Apart from above 5, it also saves and return list of columns which are used in ensemble layer with name best_ensemble_columns These columns are used in the exact same order for feature matrix in ensemble layer.
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
# convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
# Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()
pip install TextFeatureSelection
Md Azimul Haque (2022). Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists. Lulu Press, Inc.
A Comparative Study on Feature Selection in Text Categorization by Yiming Yang and Jan O. Pedersen
Entropy based feature selection for text categorization by Christine Largeron, Christophe Moulin, Mathias Géry
Categorical Proportional Difference: A Feature Selection Method for Text Categorization by Mondelle Simeon, Robert J. Hilderman
Feature Selection and Weighting Methods in Sentiment Analysis by Tim O`Keefe and Irena Koprinska
Feature Selection For Text Classification Using Genetic Algorithms by Noria Bidi and Zakaria Elberrichi