deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

Control over how text gets truncated? #796

Closed johann-petrak closed 2 years ago

johann-petrak commented 3 years ago

I guess this is really a huggingface transformers issue, but is there a way to control how text can get truncated for simple text classification (so only one text field that gets tokenized)?

Sometimes, the relevant part of longer text may be at the beginning and the end of that text, so ideally, the tokenizer would create the word pieces, then choose k1 from the beginning of the sequence and k2 from the end (possibly separated by a separator token) such that k1+k2 <= maxlen.

Is this possible somehow or does text always have to get truncated such that only the beginning is used?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.