Do you need to tokenize your data when using a BERT/ROBERTA model?

Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch

https://unbabel.github.io/OpenKiwi/

GNU Affero General Public License v3.0

229 stars 48 forks source link

Do you need to tokenize your data when using a BERT/ROBERTA model? #111

Open zolastro opened 2 years ago

zolastro commented 2 years ago

Considering that these models have their own tokenization and BPE models, what is the format of the input files to train a QE model using any of this LM? Should you apply any kind of previous tokenization/casing model?

Thanks in advance for your help!