Closed aquatiko closed 5 years ago
This got me thinking, here we have just added a default function to skip the unknown words from the dictionary. But if the user is giving his own function, then the result should be some word_id
from the dictionary. This, is because the variable res
holds the ids of all words, which is the inputted to the model to operate and later find an embedding for it.
Maybe, we should add a step to first process the unknown word by the user given function and again check for it's presence in dictionary and if still not present, then we can safely ignore it.
Suggestions @aviks?
@aquatiko What do you get if you run the following under your additions?
using TextAnalysis
model = SentimentAnalyzer()
doc = Document("some sense duh")
model(doc)
I am getting a BoundsError
for any document that includes the word "duh" after implementing your proposed changes in my local copy of TextAnalysis. I suspect it's something I didn't do correctly while emulating what you did in this pr.
@ksteimel This was one interesting bug to figure out!!
It seems like the model was trained on 8k+ words but the no. of unique words which actually have embedding is 5k.
Thus the word like duh
or rohit
were actually present in the dictionary but didn't have an embedding.
Thus the implementation became a bit buggy.
I added some condition, It seems to be working just fine now.
But if the user is giving his own function, then the result should be some word_id from the dictionary.
This is a good point. Maybe the dictionary should be passed to the handle_unknown
function?
@aviks Yeah, we can pass the dictionary to the handle_unknown
function and the user can use the dictionary to modify his words, maybe apply a similarity match with his own corpus or some other similar method of choice. But, one thing is for sure, we have to drop the words which couldn't be brought into our dictionary's domain in regards to current model.
I was having some trouble while working with the sentiment lexicons with the same issue.
I highly suggest there should be a similarity matching method inbuilt.
@aviks Yeah, we can pass the dictionary to the
handle_unknown
function and the user can use the dictionary to modify his words, maybe apply a similarity match with his own corpus or some other similar method of choice. But, one thing is for sure, we have to drop the words which couldn't be brought into our dictionary's domain in regards to current model.
@oxinabox I have added the suggested changes but for adding more tests we have to decide if this should be the desired behaviour
Ping @aviks. Could we have some conclusions regarding this desired behaviour?
Yeah, this looks like a sensible strategy to me.
I have added the required changes but can't seem to find why some of Platform: x86
tests fails :disappointed:
You are probably on top of this but
Still need to add:
handle_unknown
throws an errorhandle_unknown
returns []
handle_unknown
returns a word that is still unknownThese respective tests covers this:
* Tests where `handle_unknown` throws an error * Tests where `handle_unknown` returns `[]`
https://github.com/JuliaText/TextAnalysis.jl/blob/d612724860ac51e30f9c8e626dacd520660c3251/test/sentiment.jl#L17 and https://github.com/JuliaText/TextAnalysis.jl/blob/d612724860ac51e30f9c8e626dacd520660c3251/test/sentiment.jl#L12
* Tests where `handle_unknown` returns a word that is still unknown
@aviks Would this need more work or is this good enough to be merged now?
Added a default function to handle unknown words while using
SentimentAnalyser
. Default function: Skips the wordCloses: https://github.com/JuliaText/TextAnalysis.jl/issues/83 Closes: https://github.com/JuliaText/TextAnalysis.jl/issues/122