German Bert Tokenization produces [UNK] for non alpha-numeric symbols like: "?" "!" "." "@"

Timoeller commented 5 years ago

English Bert tokenization does not seem to produce [UNK] tokens.

When comparing the vocab.txt files for German and English, I find that the German one does not have standalone "!" or "@" symbols - just "##!" or "##@". Whereas the English vocab contains these symbols.

tholor commented 4 years ago

It's clearly an unwanted model behaviour due to the rather clean corpus used for vocab creation. We (and also others) learned from it :) However, this is nothing related to FARM directly - so I am closing this issue now.

Recommendation for others working with GermanBERT and data with many @ / ! (standalone or at the start of a word):

Simply add those tokens to your tokenizer via tokenizer.add_tokens(["!", "@", ...])

The model still won't have pretrained embeddings for those characters, but it can then actually learn them during LM finetuning / downstream training.

Timoeller commented 4 years ago

I am reopening this issue, since conversion of normal symbols to [UNK] seems to be related to a change in the tokenization. Now also previously converted symbols like "?" in "Ist das eine Frage?" get converted to [UNK].

This might be due to the BasicTokenizer. It used to be a whitespace tokenizer but is now splitting "Frage?" into "Frage" and "?" tokens. A potential fix would be to change the vocab.txt occurences of ##? or ##. etc to be standalone symbols and reupload the files.

PhilipMay commented 4 years ago

Also see my little gist here which shows the problem: https://gist.github.com/PhilipMay/9ece696dc11d7d57fee3f2f67b591eb4

Same problem happens when you use FARM to tokenize btw.

PhilipMay commented 4 years ago

An other alternative fix would be to provide a tokenizer that is working the same way as the tokenizer that was used to create this lanuage model. IMO the tokenizer (and the whole preprocessing pipeline) should never change between language model creation, finetuning, training and production.

IMO this would be a better solution then "hacking" additional tokens into the vocab. The problem with ? and ! is just a symptom of an iconsistent tokenizer. Noone knows the other side effects.

The tokenizer could be provided by:

Hugging Faces (by a switsch that turns on backwards compatibility) -> maybe open an issue about this at HF repo?
here at FARM
somewhere else

For me fixing the FH tokenizer by applying a backwars compatibility switsch would be the best solution. This should be easy because the "old" code should still be visible in GIT.

What do you think?

PhilipMay commented 4 years ago

Maybe someone (tm) could spot the "breaking" change at HF side. Here is the link: https://github.com/huggingface/transformers/commits/b90745c5901809faef3136ed09a689e7d733526c/src/transformers/tokenization_bert.py

Timoeller commented 4 years ago

The more I think about changing the vocab.txt file by removing "##" from punctuation, the better this solution becomes, because:

"##?" and "?" are conceptually the same - therefor should have the same embedding
if we only have standalone "?" tokens, it will also cover free standing ? (so it fixes previously unknown tokens)
the way the basic tokenizer now splits words from punctuation is meaningful and should be the expected behavior of a tokenizer

Possible downsides I see:

This change will not be documented in code, since we will only change model files on s3. We can inform about this change in the model card on HF.
Old German Bert vocab.txt files might be hard to remove and might produce unexpected behavior

We also looked into the tokenization issue on HF side. This is the problematic line where the punctuation is splitted away from text. A code level fix is by disabling the basic tokenization. You can set do_basic_tokeize=False when loading the tokenizer. This fixes the 'Ist das eine Frage?' tokenization, but will have side effects on other strings.

We discussed internally and believe a model side change of vocab.txt is actually a pretty clean solution:

Hiding unnecessary complexity from user
Tokenization will work the way it should for all tokens (present in the vocab)
Tokens get replaced by one corresponding embedding

@PhilipMay what are your thoughts on model side changes? I guess you already spent some time thinking about it, since you also discovered the bug.

PhilipMay commented 4 years ago

By changing the vocab we should be careful and consider all situations that have been changed by the new _run_split_on_punc function. What about a , and ;? They are also punctations.

A 2nd thought: What about releasing a 2nd model. Maybe called bert-base-german-cased-punctation-split that is basicaly the same but with different vocab. This way you do not break anything on the old model and can use both model cards to document the differences and changes.

Timoeller commented 4 years ago

Hey @PhilipMay, totally agreed. Changing the vocab is not trivial. There are actually many symbols that needs to change.

I uploaded a new vocab file here. You can test the new tokenization by renaming this file to "vocab.txt" in a folder "bert-german-cased-test-punct" and calling

tokenizer = Tokenizer.load("path-to-folder/bert-german-cased-test-punct")

I shortly tested this vocab file and it produces much better tokenization. I will do some performance tests tomorrow.

I uploaded a gist how I transformed the vocab and used HFs _run_split_on_punc to find the problematic punctuations. This gist is quite hacky : )

Timoeller commented 4 years ago

I did performance checks with the new vocab on GermEval since I expect most tokenization issues there. The baseline performance I got from our blog article on German Bert:

GermEval 2018 Coarse: new vocab: 0.750 MLflow - old vocab: 0.747 GermEval 2018 Fine: new vocab: 0.474 Mlflow - old vocab: 0.488

So it seems the new vocab is performing similar to the old vocab - there is always some variation between Bert runs so it is hard to tell exactly.

It would be nice if you could give some feedback to the new vocab @PhilipMay I would propose to change the vocab file then on our server and document the change here and in HF.

PhilipMay commented 4 years ago

@Timoeller what sounds reasonable. At the moment I can not help with tests of the vocab since I have other stuff to do. Sorry.

Timoeller commented 4 years ago

We changed the vocab file on s3 This change affects HF transformers immediately when people load the vocab from remote. Here is the link to the deprecated vocab file.

@danieldk provided a diff of both vocabs here

Changes in documentation I updated the model card with huggingface/transformers/pull/3618

deepset-ai / FARM

German Bert Tokenization produces [UNK] for non alpha-numeric symbols like: "?" "!" "." "@" #60