The model is fine. This line adds the special tokens we need.
The pooler has cls_is_last_token
set to True
, which is a questionable choice for RoBERTa, but it just means we get the embedding of the "</s>"
token instead of "<s>"
. That's not the end of the world.
Thanks @dirkgr for the clarification, and sorry to bother you again! Maybe i do not get the full picture, but i would be super grateful if you could point me further in the right direction.
Is it not set to False? Maybe i am reading the jsonnet in a wrong way ...
Sorry again to bother you, and thank you for your time!
You are right about these things, but I see the correct tokens in the debugger. The sequences that end up in the forward()
method all start with 0
, as they should. I'm investigating ...
This is the line where the special tokens are added:
When I run your example, print(tensor_dict)
also prints a tensor that starts with "<s>"
) and ends with "</s>"
). Looks like everything is alright?
In [6]: print(tensor_dict)
{'text': {'tokens': {'token_ids': tensor([[ 0, 26615, 9226, 2279, 2160, 154, 20951, 328, 2]]), 'mask': tensor([[True, True, True, True, True]]), 'type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'wordpiece_mask': tensor([[True, True, True, True, True, True, True, True, True]]), 'offsets': tensor([[[1, 1],
[2, 2],
[3, 5],
[6, 6],
[7, 7]]])}}}
Thank you @dirkgr for investigating further, and sorry for my delayed answer!
You are right, the output of the indexer does contain the word piece indexes, but i think key to the described behavior are the offsets
and how the PretrainedTransformerMismatchedEmbedder
uses them. If you set a breakpoint at the end of the forward method in a debugger, you can see that the first returned embedding vector does not correspond to the index 0 (<s>
) token (compare embeddings
with the returned orig_embeddings
). This is due to the first offsets
being [1, 1]. In the basic_classifier
model we than pass these returned embeddings into the cls_pooler
Thanks again for your time and please let me know if i should provide other examples!
I get it now. You are right! I put a fix at
branch of AllenNLP.pip freeze
I think the usage of the cls_pooler as
is not appropriate in this model. If i am not mistaken thePretrainedTransformerMismatchedIndexer/Embedder
get rid of the special tokens via theoffsets
, so the cls_pooler just takes the embedding of the first "real text" token.Python traceback:
``` ```
OS: Ubuntu 20.04
Python version: 3.7.7
Output of
pip freeze
Steps to reproduce
Example source:
``` from import SpacyTokenizer from import PretrainedTransformerMismatchedIndexer from import TextField from import Vocabulary from import Instance from import Batch from allennlp.modules.token_embedders import PretrainedTransformerMismatchedEmbedder from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder input_str = "Check this annoying string!" tokenizer = SpacyTokenizer() token_indexer = { "tokens": PretrainedTransformerMismatchedIndexer( model_name="distilroberta-base" ) } tf = TextField(tokenizer.tokenize(input_str), token_indexer) instance = Instance({"text": tf}) vocab = Vocabulary.from_instances([instance]) batch = Batch([instance]) batch.index_instances(vocab) padding_length = batch.get_padding_lengths() embedder = PretrainedTransformerMismatchedEmbedder( model_name="distilroberta-base" ) tf_embedder = BasicTextFieldEmbedder({"tokens": embedder}) tensor_dict = batch.as_tensor_dict(padding_length) embeddings = tf_embedder(tensor_dict["text"]) print(tf) print(tensor_dict) print(embeddings) ```