Open mboedigh opened 5 years ago
Good catch, thanks.
On further consideration, this change causes the existing test to fail : https://github.com/JuliaText/TextAnalysis.jl/blob/59e0a70a628dff1dea9e1e94cf69978e7ecc6ef2/test/sentiment.jl#L8
Investigating...
I believe there is a deeper problem in the word embedding. Bad words such as "hate" get good scores.
using TextAnalysis
d_bad = StringDocument("a horrible thing that everyone hates")
prepare!(d_bad, strip_case | strip_punctuation)
s = SentimentAnalyzer()
[s(w) for w in words]
I note that there are 88587 words in the dictionary
length(s.model.words)
but the lookup table has only 5000 entries
size(s.model.weight[:embedding_1]["embedding_1"]["embeddings:0"],2)
Invoking s(d_illegal)
, with d_illegal
being any StringDocument
containing a word mapping higher than 5000 will cause an error. I don't know where the weights come from exactly so I can't track it down.
The following code seems to have a bug in that it reshapes a matrix in an apparent attempt to transpose:
I propose the following: