allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

Default `add_special_tokens` to `False` in the pretrained transformer tokenizer #4572

Closed dirkgr closed 3 years ago

dirkgr commented 4 years ago

It should be False by default, with special tokens added using the add_special_tokens call. All the training configs will have to be updated accordingly, or better, the code should be updated to add special tokens, at least when we know that we'll only use transformer tokenizers. We didn't do this in 1.0 because of backwards compatibility.

matt-gardner commented 4 years ago

I'm not sure I agree with this, as I don't see a way of doing this that lets you easily swap between tokenization schemes that don't use special tokens. Unless you're saying this should be a method on our base Tokenizer class, that's a no-op by default?

dirkgr commented 4 years ago

I see that concern. I'm also not wild about making special tokens an aspect of tokenizers. It doesn't feel like a tokenization concern. It feels like it's an implementation detail of the embedder, but we don't want to do it there because we are worried about performance (though we have never measured it).

We could have the indexer add the special tokens instead?

What do we do about pairs of text sequences then?

matt-gardner commented 4 years ago

We could have the indexer add the special tokens instead?

The contract is that each token in the TextField gets an embedding in the model. So if you want to be able to use, e.g., the [CLS] token, you need to have it be a token, not something that's added in the indexer.

What do we do about pairs of text sequences then?

I've been thinking that the best way to handle this, which would solve several problems, is to have something like a TextPairField, or a TextListField, or something. If you're using a transformer, this would produce a single tensor with everything concatenated. If you're not, you can specify names for the individual tensors that you get out, so you can, e.g., get question and passage output for BiDAF. But this also messes with how the tokenizer works, so it's not a fully-thought-out solution to anything at this point.

dirkgr commented 3 years ago

Close as wont fix. We are scared of the silent backwards compatibility issues.