Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor

deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

https://farm.deepset.ai

Apache License 2.0

1.73k stars 247 forks source link

Dynamic token IDs for _mask_random_words function in BertStyleLMProcessor #804

Closed felixvor closed 3 years ago

felixvor commented 3 years ago

Tokens were previously hardcoded so only the default config of bert-base was compatible. If the vocab.txt of a model had PAD, SEP, CLS tokens in a different position, the processor would ignore the wrong tokens for whole-word-masking. If MASK was at the wrong position, the processor would crash.

Tokens IDs are now grabbed from the vocabulary directly and IDs are not hardcoded anymore. This should make any model compatible that has SEP, CLS, PAD and MASK anywhere in their vocab.txt.

felixvor commented 3 years ago

oh yes thanks i overlooked that! for now i only started the training process for testing and saw that the processor was returning valid data and not crash. wasnt able to finish a training and check results yet. any other ideas for testing this that could make sense?

julian-risch commented 3 years ago

That's good! I did some local testing withFARM/test/test_lm_finetuning.py and changed the language model to bert-base-german-cased there. Works. I found out that the token "I" has index 103 in bert-base-german-cased vocabulary. So I also tested with this token by adding it to FARM/test/samples/lm_finetuning/train-sample.txt. It works with your changes but it does not without them. Great! 👍 Ready to merge. Let us know how the finetuned model performs!