dbmdz / berts

DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models
MIT License
155 stars 12 forks source link

Handling of german special charaters #13

Closed olastor closed 4 years ago

olastor commented 4 years ago

Hi, I am wondering how the tokenizer or the german model will treat input words with special characters like "ß", "ö", "ä", "ü".

I have some input sentences in Latin-1 where the special characters are normalized like "ß" -> "ss" or "ö" -> "oe". Will training with this data be effective or do I have to convert the special characters back to being "ß", "ö", "ä", "ü" again?

Thanks

stefan-it commented 4 years ago

Hi @olastor ,

for the cased model, no normalization will be performed. E.g. consider this example text:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-cased")  
tokenizer.tokenize("ÖÄÜ öäüß")

This will output:

['Ö', '##Ä', '##Ü', 'ö', '##ä', '##üß']

So German umlauts are not normalized. If you're using the uncased model, then there's this "strip accents" feature that is used by the original BERT Tokenizer:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
tokenizer.tokenize("ÖÄÜ öäüß")

Output will be lowercased and umlauts are "removed", output is:

['o', '##au', 'o', '##auß']

and there's no ß to ss conversion.

So just keep this in mind, when using the uncased model :)

olastor commented 4 years ago

@stefan-it Thank you very much for the quick and elaborate answer, that's what I was looking for!