StatguyUser / TextFeatureSelection

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
MIT License
50 stars 5 forks source link

Support for N-grams or user input vocabulary #8

Closed murali-munna closed 3 years ago

murali-munna commented 3 years ago

@StatguyUser

I have tried it on my corpus and it returned feature scores for all unigrams. Any plans to include bi/tri grams or an input to give a vocabulary of n-grams for which we want the feature scores?

StatguyUser commented 3 years ago

Hi, Thanks for using my work. Appreciate the request. I will add this amongst few other fixes and enhancements in next version.

There is a workaround. create bigram and trigram with underscore _ character and concatenate with original text with a space. For example if your original corpus is i am content then modify your text to i am content i_am am_context. Here i created bigram with underscore character and all bigrams are added back into original text with a space character separating bigrams from the original text.

Binary classification

input_doc_list=['i am content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df

word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain
-- | -- | -- | -- | -- | --
algebra | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
am | 2 | 1.0 | -inf | 1.333333 | 0.0
cannot | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
content | 1 | 1.0 | -inf | 0.444444 | 0.0
go | 1 | 1.0 | -inf | 0.444444 | 0.0
having | 1 | 1.0 | -inf | 0.444444 | 0.0
learn | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
learning | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
life | 1 | 1.0 | -inf | 0.444444 | 0.0
linear | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
machine | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
mars | 1 | 1.0 | -inf | 0.444444 | 0.0
my | 1 | 1.0 | -inf | 0.444444 | 0.0
of | 1 | 1.0 | -inf | 0.444444 | 0.0
the | 1 | 1.0 | -inf | 0.444444 | 0.0
time | 1 | 1.0 | -inf | 0.444444 | 0.0
to | 1 | 1.0 | -inf | 0.444444 | 0.0
want | 1 | 1.0 | -inf | 0.444444 | 0.0
without | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
you | 1 | -1.0 | 1.386294 | 4.000000 | 0.0

Binary classification with bigram in first document corpus

input_doc_list=['i am content i_am am_content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df

    word list   word occurence count    Proportional Difference Mutual Information  Chi Square  Information Gain
0   algebra 1   -1.0    1.386294    4.000000    0.0
1   am  2   1.0 -inf    1.333333    0.0
2   am_content  1   1.0 -inf    0.444444    0.0
3   cannot  1   -1.0    1.386294    4.000000    0.0
4   content 1   1.0 -inf    0.444444    0.0
5   go  1   1.0 -inf    0.444444    0.0
6   having  1   1.0 -inf    0.444444    0.0
7   i_am    1   1.0 -inf    0.444444    0.0
8   learn   1   -1.0    1.386294    4.000000    0.0
9   learning    1   -1.0    1.386294    4.000000    0.0
10  life    1   1.0 -inf    0.444444    0.0
11  linear  1   -1.0    1.386294    4.000000    0.0
12  machine 1   -1.0    1.386294    4.000000    0.0
13  mars    1   1.0 -inf    0.444444    0.0
14  my  1   1.0 -inf    0.444444    0.0
15  of  1   1.0 -inf    0.444444    0.0
16  the 1   1.0 -inf    0.444444    0.0
17  time    1   1.0 -inf    0.444444    0.0
18  to  1   1.0 -inf    0.444444    0.0
19  want    1   1.0 -inf    0.444444    0.0
20  without 1   -1.0    1.386294    4.000000    0.0
21  you 1   -1.0    1.386294    4.000000    0.0

You can do for trigram also in the same manner.

murali-munna commented 3 years ago

Thanks for suggesting the work around. I know MI and CS work on just the presence of a token, not sure whether PD and IG are dependent on frequency as well. If that's the case, I need to add those n-grams in the same frequency as well.

Anyways, I am yet to explore these two methods and also understand how the utilities at class-term level are aggregated to a term level in your final result.

Once I understand the above, I will also try to contribute to the enhancements/documentation.

StatguyUser commented 3 years ago

In case you haven't explored, I will suggest you to check SN-gram, which are rich in meta linguistic property and less arbitrary than n-grams. I created this a while ago.

https://pypi.org/project/SNgramExtractor/