Closed felixvor closed 3 years ago
oh yes thanks i overlooked that! for now i only started the training process for testing and saw that the processor was returning valid data and not crash. wasnt able to finish a training and check results yet. any other ideas for testing this that could make sense?
That's good! I did some local testing withFARM/test/test_lm_finetuning.py
and changed the language model to bert-base-german-cased there. Works. I found out that the token "I" has index 103 in bert-base-german-cased vocabulary. So I also tested with this token by adding it to FARM/test/samples/lm_finetuning/train-sample.txt
. It works with your changes but it does not without them. Great! 👍 Ready to merge. Let us know how the finetuned model performs!
Tokens were previously hardcoded so only the default config of bert-base was compatible. If the vocab.txt of a model had PAD, SEP, CLS tokens in a different position, the processor would ignore the wrong tokens for whole-word-masking. If MASK was at the wrong position, the processor would crash.
Tokens IDs are now grabbed from the vocabulary directly and IDs are not hardcoded anymore. This should make any model compatible that has SEP, CLS, PAD and MASK anywhere in their vocab.txt.
Related Issue: https://github.com/deepset-ai/FARM/issues/800