I guess this is really a huggingface transformers issue, but is there a way to control how text can get truncated for simple text classification (so only one text field that gets tokenized)?
Sometimes, the relevant part of longer text may be at the beginning and the end of that text, so ideally, the tokenizer would create the word pieces, then choose k1 from the beginning of the sequence and k2 from the end (possibly separated by a separator token) such that k1+k2 <= maxlen.
Is this possible somehow or does text always have to get truncated such that only the beginning is used?
I guess this is really a huggingface transformers issue, but is there a way to control how text can get truncated for simple text classification (so only one text field that gets tokenized)?
Sometimes, the relevant part of longer text may be at the beginning and the end of that text, so ideally, the tokenizer would create the word pieces, then choose k1 from the beginning of the sequence and k2 from the end (possibly separated by a separator token) such that k1+k2 <= maxlen.
Is this possible somehow or does text always have to get truncated such that only the beginning is used?