Closed dirkgr closed 3 years ago
I'm not sure I agree with this, as I don't see a way of doing this that lets you easily swap between tokenization schemes that don't use special tokens. Unless you're saying this should be a method on our base Tokenizer
class, that's a no-op by default?
I see that concern. I'm also not wild about making special tokens an aspect of tokenizers. It doesn't feel like a tokenization concern. It feels like it's an implementation detail of the embedder, but we don't want to do it there because we are worried about performance (though we have never measured it).
We could have the indexer add the special tokens instead?
What do we do about pairs of text sequences then?
We could have the indexer add the special tokens instead?
The contract is that each token in the TextField
gets an embedding in the model. So if you want to be able to use, e.g., the [CLS] token, you need to have it be a token, not something that's added in the indexer.
What do we do about pairs of text sequences then?
I've been thinking that the best way to handle this, which would solve several problems, is to have something like a TextPairField
, or a TextListField
, or something. If you're using a transformer, this would produce a single tensor with everything concatenated. If you're not, you can specify names for the individual tensors that you get out, so you can, e.g., get question
and passage
output for BiDAF. But this also messes with how the tokenizer works, so it's not a fully-thought-out solution to anything at this point.
Close as wont fix. We are scared of the silent backwards compatibility issues.
It should be
False
by default, with special tokens added using theadd_special_tokens
call. All the training configs will have to be updated accordingly, or better, the code should be updated to add special tokens, at least when we know that we'll only use transformer tokenizers. We didn't do this in 1.0 because of backwards compatibility.