Closed olastor closed 4 years ago
Hi @olastor ,
for the cased model, no normalization will be performed. E.g. consider this example text:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
tokenizer.tokenize("ÖÄÜ öäüß")
This will output:
['Ö', '##Ä', '##Ü', 'ö', '##ä', '##üß']
So German umlauts are not normalized. If you're using the uncased model, then there's this "strip accents" feature that is used by the original BERT Tokenizer:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
tokenizer.tokenize("ÖÄÜ öäüß")
Output will be lowercased and umlauts are "removed", output is:
['o', '##au', 'o', '##auß']
and there's no ß
to ss
conversion.
So just keep this in mind, when using the uncased model :)
@stefan-it Thank you very much for the quick and elaborate answer, that's what I was looking for!
Hi, I am wondering how the tokenizer or the german model will treat input words with special characters like "ß", "ö", "ä", "ü".
I have some input sentences in Latin-1 where the special characters are normalized like "ß" -> "ss" or "ö" -> "oe". Will training with this data be effective or do I have to convert the special characters back to being "ß", "ö", "ä", "ü" again?
Thanks