Open tanmaykm opened 5 years ago
Is work still needed on this issue? @aviks
@aviks is this issue fixed or still help needed?
I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.
@aviks Hi! I think I figured out what's going on here. It comes down to the stem
function in line 38 of stemmer.jl
below, which stems the n-gram (token
), resulting in its stemmed version (new_token
):
The problem arises from the fact that token
(the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument
and stem each word in the string, or we'd want to think about it as a TokenDocument
and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String
, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.
This might mean fundamentally altering the nature of NGramDocument
s to be made up of either StringDocument
s or vectors of strings like TokenDocument
s are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!
(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change
new_token = stem(stemmer, token)
to
new_token = stem_all(stemmer, token)
and be done with it, which is also an option...)
Stemming a NGramDocument stems only the last word of each ngram. Notice below how
repository
is stemmed torepositori
in one place but left intact in another.While stemming a StringDocument stems each word: