JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

improper stemming of NGram documents #149

Open tanmaykm opened 5 years ago

tanmaykm commented 5 years ago

Stemming a NGramDocument stems only the last word of each ngram. Notice below how repository is stemmed to repositori in one place but left intact in another.

julia> td = NGramDocument("this repository of julia language", 3)
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("language"=>1,"repository"=>1,"this"=>1,"this repository of"=>1,"of julia language"=>1,"julia language"=>1,"of"=>1,"julia"=>1,"this repository"=>1,"repository of"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(td); td
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("languag"=>1,"this"=>1,"this repository of"=>1,"of julia languag"=>1,"this repositori"=>1,"of"=>1,"julia"=>1,"repositori"=>1,"repository of"=>1,"of julia"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

While stemming a StringDocument stems each word:

julia> sd = StringDocument("this repository of julia language")
StringDocument{String}("this repository of julia language", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(sd); sd
StringDocument{String}("this repositori of julia languag", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
sean-gauss commented 4 years ago

Is work still needed on this issue? @aviks

bnriiitb commented 4 years ago

@aviks is this issue fixed or still help needed?

sean-gauss commented 4 years ago

I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.

mostol commented 2 years ago

@aviks Hi! I think I figured out what's going on here. It comes down to the stem function in line 38 of stemmer.jl below, which stems the n-gram (token), resulting in its stemmed version (new_token):

https://github.com/JuliaText/TextAnalysis.jl/blob/a38d8d70e9588c77b889c52b8f1f623920e34630/src/stemmer.jl#L36-L48

The problem arises from the fact that token (the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument and stem each word in the string, or we'd want to think about it as a TokenDocument and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.

This might mean fundamentally altering the nature of NGramDocuments to be made up of either StringDocuments or vectors of strings like TokenDocuments are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!

(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change

new_token = stem(stemmer, token)

to

new_token = stem_all(stemmer, token)

and be done with it, which is also an option...)