Closed sumit-agrwl closed 1 year ago
For a setup as described in MAD-X you would train a Bottleneck Adapter with an invertible adapter on a language modeling task like MLM on unlabled text. To ensure that the model is able to embed the tokens well it makes sense to use a model with vocabulary intended for multilingual use cases like bert-base-multilingual-cased
. You can check out the run_mlm.py example script to see how MLM is done with adapters.
I understand that it makes sense to use mBERT but what for languages with tokens that are not covered by the mBERT tokenizer?
This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.
This issue was closed because it was stale for 14 days without any activity.
For unseen tokens of a seen script (e.g. latin) the tokenizer has the character-level fallback option. you can learn an entirely new tokenizer for unseen scripts. See https://aclanthology.org/2021.emnlp-main.800/
I have a general question of how are you training language adapters for languages not seen by the model, which can be done through adapter hub, as mentioned in the MAD-X paper. I was curious about this because there is no place which lists a way to do that. Specifically, how does the model incorporate the fact that the tokens of the new language wont be part of the existing model's vocabulary and hence the token embeddings wont exist by default