[fix] changing tokenizer interface preprocessing script

fe1ixxu / BiBERT

This is the repository of the EMNLP 2021 paper "BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation".

MIT License

30 stars 6 forks source link

[fix] changing tokenizer interface preprocessing script #3

Open alisafaya opened 1 year ago

alisafaya commented 1 year ago

The current implementation in this file applies lowercase automatically since it is using BertTokenizer interface from huggingface/transformers with default settings. This PR should fix this issue.