Control over how text gets truncated?

deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Apache License 2.0

1.73k stars 247 forks source link

I guess this is really a huggingface transformers issue, but is there a way to control how text can get truncated for simple text classification (so only one text field that gets tokenized)?

Sometimes, the relevant part of longer text may be at the beginning and the end of that text, so ideally, the tokenizer would create the word pieces, then choose k1 from the beginning of the sequence and k2 from the end (possibly separated by a separator token) such that k1+k2 <= maxlen.

Is this possible somehow or does text always have to get truncated such that only the beginning is used?

deepset-ai / FARM

Control over how text gets truncated? #796