max sequence length - Githubissues

dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models

MIT License

155 stars 12 forks source link

max sequence length #33

Open denizerden opened 3 years ago

denizerden commented 3 years ago

How to user bert turkish sentiment cased model for calculating sentiment scores of sentences with more than 512 sequence length?

stefan-it commented 3 years ago

Hi @denizerden ,

normally, longer sentences will be truncated then. You could use the BERTurk models with a larger vocab (128k):

https://huggingface.co/dbmdz/bert-base-turkish-128k-cased or
https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased

With these models you should be able to use "longer" sentences, because the tokenizer uses less subtokens per token in theory (compared to the "normal" BERTurk models that have a 32k vocab).