allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.75k stars 2.25k forks source link

Rreproduce Performance of SNLI and, SST-5 and Coref #1583

Closed apeterswu closed 6 years ago

apeterswu commented 6 years ago

Hi all,

I have tried to reproduce the baseline models and ELMO feeding models by running the configurations provided in the trianing_config. The three experiments I tried: Textual Entailment (SNLI), Sentiment Analysis (SST-5) and Coreference Resolution (CoNLL2012), the performance is somehow strange. For the SNLI (baseline model without ELMO embedding, the configuration is decomposiable_attention.json): the accuracy is only 84.88, there is nearly 4 points gap to match 88.0; For the SST0-5 (ELMO feeding model, the configuration is biattention_classification_network_elmo.json): the accuracy is 53.22, there is about 1.5 points gap to match 54.7; For Coref (baseline model without ELMO embedding, the configuration is coref.jsonnet, the dataset is processed by the script you provided): this is the most strang task, the training_coref_prevision during training is about 0.06074188036270887, the training_coref_recall is about 0.172069153477043, and the traininf_coref_f1 is only 0.08978787441136321, therefore the test performance is also bad, "test_coref_f1": 0.0839441860224286.

I have no idea about the above results. Could you please provide any help to reproduce the results? Thanks a lot.

nelson-liu commented 6 years ago

Going in order:

  1. For the SNLI (baseline model without ELMO embedding, the configuration is decomposiable_attention.json): the accuracy is only 84.88, there is nearly 4 points gap to match 88.0;

The paper runs the ESIM model, not the Decomposable Attention model. Try using esim.json

2.

For the SST0-5 (ELMO feeding model, the configuration is biattention_classification_network_elmo.json): the accuracy is 53.22, there is about 1.5 points gap to match 54.7;

Speaking as the person who wrote the model, I didn't really tune it to try to get the 54.7 number reported in the paper (the code in allennlp is a reimplementation of the code used to produce the 54.7 number in the figure). See, e.g., https://github.com/allenai/allennlp/pull/1253#issuecomment-391170901 .

  1. For Coref (baseline model without ELMO embedding, the configuration is coref.jsonnet, the dataset is processed by the script you provided):

We're looking into the coref issues, see: https://github.com/allenai/allennlp/issues/1545

I'm closing this for now, but feel free to reopen if you have further questions.