[ELMo] Reproducing 6 ELMo downstream tasks

szha commented 5 years ago

The results on the following tasks are reported in the ELMo paper (https://allennlp.org/elmo)

[ ] SQuAD
[ ] SNLI
[ ] SRL
[ ] Coref
[x] NER (baseline in progress #466)
[ ] Sentiment

ThomasDelteil commented 5 years ago

tldr: I think this is quite critical to do in order to simply make sure the pre-trained embeddings we provide are actually reproducing the paper's result. Currently the ELMo embeddings seem to not be usable out-of-the-box in a satisfying manner.

I've add very mixed result in using directly the pre-trained ELMo embeddings for sentence embeddings. I used the tutorial http://gluon-nlp.mxnet.io/examples/sentence_embedding/elmo_sentence_representation.html to extract the contextualized word embeddings and used several methods to get the the overall sentence embeddings:

using the average of the embeddings of only the last layer (capturing the most semantic information according to the paper)
using an average of the average embeddings of all layers.

I used the cosine similarity for comparing the resulting sentence embeddings and the results are not very good:

for example:

'The name hippopotamus comes from the ancient Greek word hippopotamos, which means river horse', and President Donald Trump last week intended to reverse sanctions imposed on two Chinese shipping companies accused of violating North Korea trade prohibitions

are closer than

'The name hippopotamus comes from the ancient Greek word hippopotamos, which means river horse', and 'Hippos rank as one of the largest animals in Africa and are not known for their sunny dispositions, causing more human deaths in Africa annually than lions, leopards, crocodiles, or any other of the major predators',

0.95187436 vs 0.92387373, which are also fairly higher values than I would expect.

The poor quality of these results has been corroborated by @la-cruche https://discuss.mxnet.io/t/understand-gluonnlp-elmo-output-shape/3969/9

When doing the same technique with TF embeddings he has claimed to have a lot better results.

PCA'ed embeddings from TF-Hub ELMo vs Gluon-NLP ELMo on sentences of 2 different articles:

leezu commented 5 years ago

@cgraywang do you know what could be the problem?

dmlc / gluon-nlp

[ELMo] Reproducing 6 ELMo downstream tasks #523