Closed murali-munna closed 3 years ago
Hi, Thanks for using my work. Appreciate the request. I will add this amongst few other fixes and enhancements in next version.
There is a workaround. create bigram and trigram with underscore _ character and concatenate with original text with a space. For example if your original corpus is i am content
then modify your text to i am content i_am am_context
. Here i created bigram with underscore character and all bigrams are added back into original text with a space character separating bigrams from the original text.
input_doc_list=['i am content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df
word list | word occurence count | Proportional Difference | Mutual Information | Chi Square | Information Gain
-- | -- | -- | -- | -- | --
algebra | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
am | 2 | 1.0 | -inf | 1.333333 | 0.0
cannot | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
content | 1 | 1.0 | -inf | 0.444444 | 0.0
go | 1 | 1.0 | -inf | 0.444444 | 0.0
having | 1 | 1.0 | -inf | 0.444444 | 0.0
learn | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
learning | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
life | 1 | 1.0 | -inf | 0.444444 | 0.0
linear | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
machine | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
mars | 1 | 1.0 | -inf | 0.444444 | 0.0
my | 1 | 1.0 | -inf | 0.444444 | 0.0
of | 1 | 1.0 | -inf | 0.444444 | 0.0
the | 1 | 1.0 | -inf | 0.444444 | 0.0
time | 1 | 1.0 | -inf | 0.444444 | 0.0
to | 1 | 1.0 | -inf | 0.444444 | 0.0
want | 1 | 1.0 | -inf | 0.444444 | 0.0
without | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
you | 1 | -1.0 | 1.386294 | 4.000000 | 0.0
input_doc_list=['i am content i_am am_content','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
result_df=TextFeatureSelection(target=target,input_doc_list=input_doc_list).getScore()
result_df
word list word occurence count Proportional Difference Mutual Information Chi Square Information Gain
0 algebra 1 -1.0 1.386294 4.000000 0.0
1 am 2 1.0 -inf 1.333333 0.0
2 am_content 1 1.0 -inf 0.444444 0.0
3 cannot 1 -1.0 1.386294 4.000000 0.0
4 content 1 1.0 -inf 0.444444 0.0
5 go 1 1.0 -inf 0.444444 0.0
6 having 1 1.0 -inf 0.444444 0.0
7 i_am 1 1.0 -inf 0.444444 0.0
8 learn 1 -1.0 1.386294 4.000000 0.0
9 learning 1 -1.0 1.386294 4.000000 0.0
10 life 1 1.0 -inf 0.444444 0.0
11 linear 1 -1.0 1.386294 4.000000 0.0
12 machine 1 -1.0 1.386294 4.000000 0.0
13 mars 1 1.0 -inf 0.444444 0.0
14 my 1 1.0 -inf 0.444444 0.0
15 of 1 1.0 -inf 0.444444 0.0
16 the 1 1.0 -inf 0.444444 0.0
17 time 1 1.0 -inf 0.444444 0.0
18 to 1 1.0 -inf 0.444444 0.0
19 want 1 1.0 -inf 0.444444 0.0
20 without 1 -1.0 1.386294 4.000000 0.0
21 you 1 -1.0 1.386294 4.000000 0.0
You can do for trigram also in the same manner.
Thanks for suggesting the work around. I know MI and CS work on just the presence of a token, not sure whether PD and IG are dependent on frequency as well. If that's the case, I need to add those n-grams in the same frequency as well.
Anyways, I am yet to explore these two methods and also understand how the utilities at class-term level are aggregated to a term level in your final result.
Once I understand the above, I will also try to contribute to the enhancements/documentation.
In case you haven't explored, I will suggest you to check SN-gram, which are rich in meta linguistic property and less arbitrary than n-grams. I created this a while ago.
@StatguyUser
I have tried it on my corpus and it returned feature scores for all unigrams. Any plans to include bi/tri grams or an input to give a vocabulary of n-grams for which we want the feature scores?