Question regarding training language adapters for unseen languages

adapter-hub / adapters

A Unified Library for Parameter-Efficient and Modular Transfer Learning

https://docs.adapterhub.ml

Apache License 2.0

2.58k stars 346 forks source link

Question regarding training language adapters for unseen languages #420

Closed sumit-agrwl closed 1 year ago

sumit-agrwl commented 2 years ago

I have a general question of how are you training language adapters for languages not seen by the model, which can be done through adapter hub, as mentioned in the MAD-X paper. I was curious about this because there is no place which lists a way to do that. Specifically, how does the model incorporate the fact that the tokens of the new language wont be part of the existing model's vocabulary and hence the token embeddings wont exist by default

hSterz commented 2 years ago

For a setup as described in MAD-X you would train a Bottleneck Adapter with an invertible adapter on a language modeling task like MLM on unlabled text. To ensure that the model is able to embed the tokens well it makes sense to use a model with vocabulary intended for multilingual use cases like bert-base-multilingual-cased. You can check out the run_mlm.py example script to see how MLM is done with adapters.

sumit-agrwl commented 2 years ago

I understand that it makes sense to use mBERT but what for languages with tokens that are not covered by the mBERT tokenizer?

adapter-hub-bert commented 1 year ago

This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.

adapter-hub-bert commented 1 year ago

This issue was closed because it was stale for 14 days without any activity.

JoPfeiff commented 1 year ago

For unseen tokens of a seen script (e.g. latin) the tokenizer has the character-level fallback option. you can learn an entirely new tokenizer for unseen scripts. See https://aclanthology.org/2021.emnlp-main.800/