Closed Bachstelze closed 2 years ago
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
🚀 Feature Request for universal subwords with mBART
Thanks for releasing the BBPE code! Recent results of mBART consider a universal language space. But its vocabulary is limited to the language family seen during training. Do you think that BART is compatible with the work of @kahne, @kyunghyuncho and @MultiPath for Neural Machine Translation with Byte-Level Subwords? The transformer architecture is mostly the same, but the embeddings should be contextualized and recovered during decoding to valid character sequences.
The Byte-level subword vocabulary is more compact and improves the quality of multilingual translations. With BBPE the fine-tuning to new languages can assemble on a complete vocabulary. Or do you think it is feasible to gain a universal vocabulary with BPE by taking all language families into preprocessing?
Using mBART for unseen language (families) like Abkhazian results so far in an amount of unknown tokens:
The preprocessing with xlm-R (persists on 100 languages) gives quite the same results: