Support for N-grams or user input vocabulary

murali-munna commented 3 years ago

@StatguyUser

I have tried it on my corpus and it returned feature scores for all unigrams. Any plans to include bi/tri grams or an input to give a vocabulary of n-grams for which we want the feature scores?

StatguyUser commented 3 years ago

Hi, Thanks for using my work. Appreciate the request. I will add this amongst few other fixes and enhancements in next version.

There is a workaround. create bigram and trigram with underscore _ character and concatenate with original text with a space. For example if your original corpus is i am content then modify your text to i am content i_am am_context. Here i created bigram with underscore character and all bigrams are added back into original text with a space character separating bigrams from the original text.

Binary classification

input_doc_list=['i am content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df

word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain
-- | -- | -- | -- | -- | --
algebra | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
am | 2 | 1.0 | -inf | 1.333333 | 0.0
cannot | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
content | 1 | 1.0 | -inf | 0.444444 | 0.0
go | 1 | 1.0 | -inf | 0.444444 | 0.0
having | 1 | 1.0 | -inf | 0.444444 | 0.0
learn | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
learning | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
life | 1 | 1.0 | -inf | 0.444444 | 0.0
linear | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
machine | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
mars | 1 | 1.0 | -inf | 0.444444 | 0.0
my | 1 | 1.0 | -inf | 0.444444 | 0.0
of | 1 | 1.0 | -inf | 0.444444 | 0.0
the | 1 | 1.0 | -inf | 0.444444 | 0.0
time | 1 | 1.0 | -inf | 0.444444 | 0.0
to | 1 | 1.0 | -inf | 0.444444 | 0.0
want | 1 | 1.0 | -inf | 0.444444 | 0.0
without | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
you | 1 | -1.0 | 1.386294 | 4.000000 | 0.0

Binary classification with bigram in first document corpus

input_doc_list=['i am content i_am am_content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df

    word list   word occurence count    Proportional Difference Mutual Information  Chi Square  Information Gain
0   algebra 1   -1.0    1.386294    4.000000    0.0
1   am  2   1.0 -inf    1.333333    0.0
2   am_content  1   1.0 -inf    0.444444    0.0
3   cannot  1   -1.0    1.386294    4.000000    0.0
4   content 1   1.0 -inf    0.444444    0.0
5   go  1   1.0 -inf    0.444444    0.0
6   having  1   1.0 -inf    0.444444    0.0
7   i_am    1   1.0 -inf    0.444444    0.0
8   learn   1   -1.0    1.386294    4.000000    0.0
9   learning    1   -1.0    1.386294    4.000000    0.0
10  life    1   1.0 -inf    0.444444    0.0
11  linear  1   -1.0    1.386294    4.000000    0.0
12  machine 1   -1.0    1.386294    4.000000    0.0
13  mars    1   1.0 -inf    0.444444    0.0
14  my  1   1.0 -inf    0.444444    0.0
15  of  1   1.0 -inf    0.444444    0.0
16  the 1   1.0 -inf    0.444444    0.0
17  time    1   1.0 -inf    0.444444    0.0
18  to  1   1.0 -inf    0.444444    0.0
19  want    1   1.0 -inf    0.444444    0.0
20  without 1   -1.0    1.386294    4.000000    0.0
21  you 1   -1.0    1.386294    4.000000    0.0

You can do for trigram also in the same manner.

murali-munna commented 3 years ago

Thanks for suggesting the work around. I know MI and CS work on just the presence of a token, not sure whether PD and IG are dependent on frequency as well. If that's the case, I need to add those n-grams in the same frequency as well.

Anyways, I am yet to explore these two methods and also understand how the utilities at class-term level are aggregated to a term level in your final result.

Once I understand the above, I will also try to contribute to the enhancements/documentation.

StatguyUser commented 3 years ago

In case you haven't explored, I will suggest you to check SN-gram, which are rich in meta linguistic property and less arbitrary than n-grams. I created this a while ago.

https://pypi.org/project/SNgramExtractor/

StatguyUser / TextFeatureSelection

Support for N-grams or user input vocabulary #8

Binary classification

Binary classification with bigram in first document corpus