JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
374 stars 96 forks source link

add handle_unknown function to sentiment.jl #126

Closed aquatiko closed 5 years ago

aquatiko commented 5 years ago

Added a default function to handle unknown words while using SentimentAnalyser. Default function: Skips the word

Closes: https://github.com/JuliaText/TextAnalysis.jl/issues/83 Closes: https://github.com/JuliaText/TextAnalysis.jl/issues/122

aquatiko commented 5 years ago

This got me thinking, here we have just added a default function to skip the unknown words from the dictionary. But if the user is giving his own function, then the result should be some word_id from the dictionary. This, is because the variable res holds the ids of all words, which is the inputted to the model to operate and later find an embedding for it. Maybe, we should add a step to first process the unknown word by the user given function and again check for it's presence in dictionary and if still not present, then we can safely ignore it. Suggestions @aviks?

ksteimel commented 5 years ago

@aquatiko What do you get if you run the following under your additions?

using TextAnalysis
model = SentimentAnalyzer()
doc = Document("some sense duh")
model(doc)

I am getting a BoundsError for any document that includes the word "duh" after implementing your proposed changes in my local copy of TextAnalysis. I suspect it's something I didn't do correctly while emulating what you did in this pr.

aquatiko commented 5 years ago

@ksteimel This was one interesting bug to figure out!! It seems like the model was trained on 8k+ words but the no. of unique words which actually have embedding is 5k. Thus the word like duh or rohit were actually present in the dictionary but didn't have an embedding. Thus the implementation became a bit buggy. I added some condition, It seems to be working just fine now.

aviks commented 5 years ago

But if the user is giving his own function, then the result should be some word_id from the dictionary.

This is a good point. Maybe the dictionary should be passed to the handle_unknown function?

aquatiko commented 5 years ago

@aviks Yeah, we can pass the dictionary to the handle_unknown function and the user can use the dictionary to modify his words, maybe apply a similarity match with his own corpus or some other similar method of choice. But, one thing is for sure, we have to drop the words which couldn't be brought into our dictionary's domain in regards to current model. I was having some trouble while working with the sentiment lexicons with the same issue. I highly suggest there should be a similarity matching method inbuilt.

aquatiko commented 5 years ago

@aviks Yeah, we can pass the dictionary to the handle_unknown function and the user can use the dictionary to modify his words, maybe apply a similarity match with his own corpus or some other similar method of choice. But, one thing is for sure, we have to drop the words which couldn't be brought into our dictionary's domain in regards to current model.

@oxinabox I have added the suggested changes but for adding more tests we have to decide if this should be the desired behaviour

aquatiko commented 5 years ago

Ping @aviks. Could we have some conclusions regarding this desired behaviour?

aviks commented 5 years ago

Yeah, this looks like a sensible strategy to me.

aquatiko commented 5 years ago

I have added the required changes but can't seem to find why some of Platform: x86 tests fails :disappointed:

oxinabox commented 5 years ago

You are probably on top of this but

Still need to add:

aquatiko commented 5 years ago

These respective tests covers this:

* Tests where `handle_unknown` throws an error
* Tests where `handle_unknown` returns `[]`

https://github.com/JuliaText/TextAnalysis.jl/blob/d612724860ac51e30f9c8e626dacd520660c3251/test/sentiment.jl#L17 and https://github.com/JuliaText/TextAnalysis.jl/blob/d612724860ac51e30f9c8e626dacd520660c3251/test/sentiment.jl#L12

* Tests where `handle_unknown` returns a word that is still unknown

https://github.com/JuliaText/TextAnalysis.jl/blob/d612724860ac51e30f9c8e626dacd520660c3251/test/sentiment.jl#L27

aquatiko commented 5 years ago

@aviks Would this need more work or is this good enough to be merged now?