Textual Entailment - ROBERTa model trained on SNLI - Different results using the demo and the library

silvia596 commented 4 years ago

I've just copied the usage snippet from your demo, and I obtained different results using the demo and the code (not only the probabilities are different, the results are completely different). Is the model different? How can I use the demo model in local? I attach you a pair of images with a simple example.

Versions: allennlp==1.0.0 allennlp-models==1.0.0

Thanks.

epwalsh commented 4 years ago

Hi @silvia596, the demo is using an older version of allennlp (1.0.0rc3). This could explain the difference. Also FWIW we're in the process of updating this model in the demo, see https://github.com/allenai/allennlp/issues/4457.

silvia596 commented 4 years ago

Thank you very much @epwalsh. Would there be any chance of being able to download the demo model? I did several tests and it fits better to my purpose. With my dataset, the old model is able to capture better the generalization relationship between pairs of phrases.

matt-gardner commented 4 years ago

This makes me nervous that there might be a deeper problem that was introduced with the tokenization changes in rc4, that we never fully fixed. See also this: https://github.com/allenai/allennlp/issues/4360#issuecomment-644222774. You're using the same model that's used in the demo, but you're using newer code. It's the code that's broken, not the model. Just checkout rc3, and you should get the same results as in the demo.

epwalsh commented 4 years ago

You're using the same model that's used in the demo, but you're using newer code. It's the code that's broken, not the model.

There were breaking changes to the RoBERTa tokenizer in recent HF releases. I hope that's what causing the discrepancy.

I just trained RoBERTa SNLI again on master and saw good results: https://beaker.org/ex/ex_3xaw1xzw6689/tasks/tk_zxz18n8idbn0 (validation acc of 0.9256).

But to add more confusion to the mix, my recent experiments with RoBERTa MNLI have had terrible performance. I'm in the process of git bisecting to figure out when the regression happened.

silvia596 commented 4 years ago

Thank you so much. I used the version rc3 and I obtained the same results. However, I found a little bug. When you try to predict whether a word imply another word (for phrases consisting of a single word), I obtain this error:

>>> predictor.predict(hypothesis="laugh", premise="rats")
Traceback (most recent call last):

  File "<ipython-input-21-a0e91d98c2f7>", line 2, in <module>
    premise="rats")

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\predictors\decomposable_attention.py", line 39, in predict
    return self.predict_json({"premise": premise, "hypothesis": hypothesis})

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\predictors\predictor.py", line 66, in predict_json
    return self.predict_instance(instance)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\predictors\predictor.py", line 189, in predict_instance
    outputs = self._model.forward_on_instance(instance)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\models\model.py", line 141, in forward_on_instance
    return self.forward_on_instances([instance])[0]

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\models\model.py", line 167, in forward_on_instances
    outputs = self.make_output_human_readable(self(**model_input))

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\models\basic_classifier.py", line 121, in forward
    embedded_text = self._text_field_embedder(tokens)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\modules\text_field_embedders\basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(**tensors, **forward_params_values)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\allennlp\modules\token_embedders\pretrained_transformer_embedder.py", line 124, in forward
    embeddings = self.transformer_model(**parameters)[0]

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\transformers\modeling_bert.py", line 793, in forward
    pooled_output = self.pooler(sequence_output)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\torch\nn\modules\module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\silvia.duque.moro\AppData\Local\Continuum\anaconda3\envs\rc3\lib\site-packages\transformers\modeling_bert.py", line 435, in forward
    first_token_tensor = hidden_states[:, 0]

IndexError: index 0 is out of bounds for dimension 1 with size 0

I've verified that I also get an error using the demo.

epwalsh commented 4 years ago

I'm not sure what the cause of that bug is, but I might be fixed with the new model here: https://github.com/allenai/allennlp-models/pull/102.

After that PR is merged, you could use the predictor like this:

from allennlp_models.pretrained import load_predictor

predictor = load_predictor("pair-classification-roberta-snli")
predictor.predict(hypothesis="laugh", premise="rats")

matt-gardner commented 4 years ago

It looks like this is now fixed (@epwalsh says that he tested those inputs after the PR got merged, and it's good); if you still see issues with current code and models, please open a new issue.

allenai / allennlp

Textual Entailment - ROBERTa model trained on SNLI - Different results using the demo and the library #4517