allenai / sequential_sentence_classification

https://arxiv.org/pdf/1909.04054
Apache License 2.0
77 stars 27 forks source link

Incorrect [SEP] token ID #8

Closed KacperKubara closed 4 years ago

KacperKubara commented 4 years ago

Hey! I have noticed that in model.py you obtain the [SEP] mask with this line of code: https://github.com/allenai/sequential_sentence_classification/blob/8ddc0210d08075fdcbd8e6cb2a8f88d2c8912e2e/sequential_sentence_classification/model.py#L101-L107

Is 103 a correct value? I looked at the vocab.txt from the original BERT module and it seems that [SEP] token has an ID of 102 (since the vocab file is 0 -indexed). Is it a mistake or do you use a different vocab file than the original one?

Best, Kacper

armancohan commented 4 years ago

We are using SciBERT with SciVocab. In the SciBERT vocabulary the SEP token id is 103 but in the original BERT vocabulary it is 102. As you noticed, we currently hardcode this in the code but a better solution would be to get the SEP token id from the vocabulary.

KacperKubara commented 4 years ago

Okay great, thanks for letting me know. I will close this issue as it was rather a question; hopefully, it will help others too :)

UrszulaCzerwinska commented 3 years ago

Hi, does it mean that if want to use RoBERTa type of models, your code is not directly usable?

In RoBERTa the vocabulary comes from sencepiece package and it is adapted to the train so that there is no uknown tokens, which means that there is no fixed vocab.

I wanted to use your code with the French BERT but it is based on RoBERTa.

armancohan commented 3 years ago

This codebase currently only supports BERT/SCIBERT transformers and we haven't really tested it with other Transformers.

If you wanted to use other Transformers, some changes need to be made to the config, in addition to changing the SEP token id mentioned above. Tokenizer: https://github.com/allenai/sequential_sentence_classification/blob/8ddc0210d08075fdcbd8e6cb2a8f88d2c8912e2e/sequential_sentence_classification/config.jsonnet#L21 Pretrained transformer: https://github.com/allenai/sequential_sentence_classification/blob/8ddc0210d08075fdcbd8e6cb2a8f88d2c8912e2e/sequential_sentence_classification/config.jsonnet#L48

I think AllenNLP version also needs to be upgraded to a later version that supports RoBERTa. That may break some functionality in current codebase and may need some debugging to fix. A PR is appreciated if you ended up going that route.

UrszulaCzerwinska commented 3 years ago

Thank you for your prompt answer. I will do more digging. I will inform you for sure if get to something!

UrszulaCzerwinska commented 3 years ago

Hi @armancohan , I adapted your code to Allennlp 2.0. Quite a lot of changes were required. I cannot make it to reproduce your results exactly. I managed to get close playing with params a bit. I will send you a PR with more details soon.

In theory it would be applicable to RoBERTa or other transformers. I was wondering if you have any suggestion concerning which token to use if there is more than one intermediate token.

RoBERTa, for multiple senteces embedding uses two intermediate tokens:

<s> token token token </s></s> token token </s></s>token token token </s>

I tested with the first of intermediate tokens and the closing token and it seems to be promising. It poses problem for batch >1 as I didn't find a way to contruct a mask that indicates only one of two </s></s> for multiple batches. If I take both </s> they will not corsspond to the number of sentences.

Any input on this is welcome.