bug in function embedding from sentiment.jl

JuliaText / TextAnalysis.jl

Julia package for text analysis

Other

373 stars 96 forks source link

bug in function embedding from sentiment.jl #129

Open mboedigh opened 5 years ago

mboedigh commented 5 years ago

The following code seems to have a bug in that it reshapes a matrix in an apparent attempt to transpose:

function embedding(embedding_matrix, x)
    # inefficient code:
    temp = embedding_matrix[:, Int64(x[1])+1]
    for i=2:length(x)
        temp = hcat(temp, embedding_matrix[:, Int64(x[i])+1])
    end
   # bug
    return reshape(temp, reverse(size(temp)))  
end

I propose the following:

function embedding(embedding_matrix, x)
    temp = embedding_matrix[:, Int64.(x).+1]'; 
end

# After this change the results of the following are much improved
d_good = StringDocument("A very nice thing that everyone likes")
prepare!(d_good, strip_case | strip_punctuation )
d_bad = StringDocument("A boring long dull movie that no one likes")
prepare!(d_bad, strip_case | strip_punctuation)
s = SentimentAnalyzer()
s(d_good)
s(d_bad)

aviks commented 5 years ago

Good catch, thanks.

aviks commented 5 years ago

On further consideration, this change causes the existing test to fail : https://github.com/JuliaText/TextAnalysis.jl/blob/59e0a70a628dff1dea9e1e94cf69978e7ecc6ef2/test/sentiment.jl#L8

Investigating...

mboedigh commented 5 years ago

I believe there is a deeper problem in the word embedding. Bad words such as "hate" get good scores.

using TextAnalysis
d_bad = StringDocument("a horrible thing that everyone hates")
prepare!(d_bad, strip_case | strip_punctuation)
s = SentimentAnalyzer()
[s(w) for w in words]

I note that there are 88587 words in the dictionary length(s.model.words)

but the lookup table has only 5000 entries size(s.model.weight[:embedding_1]["embedding_1"]["embeddings:0"],2)

Invoking s(d_illegal), with d_illegal being any StringDocument containing a word mapping higher than 5000 will cause an error. I don't know where the weights come from exactly so I can't track it down.