Closed Timoeller closed 4 years ago
It's clearly an unwanted model behaviour due to the rather clean corpus used for vocab creation. We (and also others) learned from it :) However, this is nothing related to FARM directly - so I am closing this issue now.
Recommendation for others working with GermanBERT and data with many @ / ! (standalone or at the start of a word):
Simply add those tokens to your tokenizer via
tokenizer.add_tokens(["!", "@", ...])
The model still won't have pretrained embeddings for those characters, but it can then actually learn them during LM finetuning / downstream training.
I am reopening this issue, since conversion of normal symbols to [UNK] seems to be related to a change in the tokenization. Now also previously converted symbols like "?" in "Ist das eine Frage?" get converted to [UNK].
This might be due to the BasicTokenizer. It used to be a whitespace tokenizer but is now splitting "Frage?" into "Frage" and "?" tokens. A potential fix would be to change the vocab.txt occurences of ##? or ##. etc to be standalone symbols and reupload the files.
Also see my little gist here which shows the problem: https://gist.github.com/PhilipMay/9ece696dc11d7d57fee3f2f67b591eb4
Same problem happens when you use FARM to tokenize btw.
An other alternative fix would be to provide a tokenizer that is working the same way as the tokenizer that was used to create this lanuage model. IMO the tokenizer (and the whole preprocessing pipeline) should never change between language model creation, finetuning, training and production.
IMO this would be a better solution then "hacking" additional tokens into the vocab. The problem with ? and ! is just a symptom of an iconsistent tokenizer. Noone knows the other side effects.
The tokenizer could be provided by:
For me fixing the FH tokenizer by applying a backwars compatibility switsch would be the best solution. This should be easy because the "old" code should still be visible in GIT.
What do you think?
Maybe someone (tm) could spot the "breaking" change at HF side. Here is the link: https://github.com/huggingface/transformers/commits/b90745c5901809faef3136ed09a689e7d733526c/src/transformers/tokenization_bert.py
The more I think about changing the vocab.txt file by removing "##" from punctuation, the better this solution becomes, because:
Possible downsides I see:
We also looked into the tokenization issue on HF side. This is the problematic line where the punctuation is splitted away from text. A code level fix is by disabling the basic tokenization. You can set do_basic_tokeize=False when loading the tokenizer. This fixes the 'Ist das eine Frage?' tokenization, but will have side effects on other strings.
We discussed internally and believe a model side change of vocab.txt is actually a pretty clean solution:
@PhilipMay what are your thoughts on model side changes? I guess you already spent some time thinking about it, since you also discovered the bug.
By changing the vocab we should be careful and consider all situations that have been changed by the new _run_split_on_punc
function. What about a ,
and ;
? They are also punctations.
A 2nd thought: What about releasing a 2nd model. Maybe called bert-base-german-cased-punctation-split
that is basicaly the same but with different vocab. This way you do not break anything on the old model and can use both model cards to document the differences and changes.
Hey @PhilipMay, totally agreed. Changing the vocab is not trivial. There are actually many symbols that needs to change.
I uploaded a new vocab file here. You can test the new tokenization by renaming this file to "vocab.txt" in a folder "bert-german-cased-test-punct" and calling
tokenizer = Tokenizer.load("path-to-folder/bert-german-cased-test-punct")
I shortly tested this vocab file and it produces much better tokenization. I will do some performance tests tomorrow.
I uploaded a gist how I transformed the vocab and used HFs _run_split_on_punc to find the problematic punctuations. This gist is quite hacky : )
I did performance checks with the new vocab on GermEval since I expect most tokenization issues there. The baseline performance I got from our blog article on German Bert:
GermEval 2018 Coarse: new vocab: 0.750 MLflow - old vocab: 0.747 GermEval 2018 Fine: new vocab: 0.474 Mlflow - old vocab: 0.488
So it seems the new vocab is performing similar to the old vocab - there is always some variation between Bert runs so it is hard to tell exactly.
It would be nice if you could give some feedback to the new vocab @PhilipMay I would propose to change the vocab file then on our server and document the change here and in HF.
@Timoeller what sounds reasonable. At the moment I can not help with tests of the vocab since I have other stuff to do. Sorry.
We changed the vocab file on s3 This change affects HF transformers immediately when people load the vocab from remote. Here is the link to the deprecated vocab file.
@danieldk provided a diff of both vocabs here
Changes in documentation I updated the model card with huggingface/transformers/pull/3618
English Bert tokenization does not seem to produce [UNK] tokens.
When comparing the vocab.txt files for German and English, I find that the German one does not have standalone "!" or "@" symbols - just "##!" or "##@". Whereas the English vocab contains these symbols.