Closed felixvor closed 3 years ago
It is because the [MASK] token is in line 6 (index 5) of vocab.txt from bert-base-german-cased, and not index 103 like for bert-base-cased. Maybe I can look into creating a PR for this.
Quick question, line 1710+ of processor.py:
# 1. Combine tokens to one group (e.g. all subtokens of a word)
cand_indices = []
for (i, token) in enumerate(tokens):
if token == 101 or token == 102 or token == 0:
continue
Would 101 and 102 be [SEP] and [CLS]?
Hi @DieseKartoffel yes, you are right, 101 is [SEP] token and 102 is [CLS] token. I quickly checked that here.
Maybe I can look into creating a PR for this. That would be great! Please let me know how it goes and if you would like to discuss any steps. You are correct that the different vocabularies and indices of special tokens 101, 102 and 103 are causing the problem.
@DieseKartoffel Thank you for your contribution to FARM! 👍 Your changes are merged now.
Describe the bug I try to finetune a german bert model with my own text corpus. When I attempt to load the data in, the process will crash at
assert 103 not in tokens #mask token
. I was able to fix the issue by using "bert-base-cased" as the base model. However, I need to finetune "bert-base-german-cased". I tried changing changing other parameters but found that the model name is the cause for the problem, as I am able to reproduce the same error when using the lm_finetune.py from the farm examples and only changing the model name there.Error message
Expected behavior Should be able to use Processors and DataSilos with any supported bert model. As long as the data is in the correct format, it should load in and training should start.
To Reproduce Simply use the lm_finetuning.py from the examples folder. Replace line 36
lang_model = "bert-base-cased"
withlang_model = "bert-base-german-cased"
and run.Maybe I am missing additional configurations that are needed to finetune german bert? I hope you can help :)
System: