facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.36k stars 6.4k forks source link

Byte-Level Subwords for mBART #1916

Closed Bachstelze closed 2 years ago

Bachstelze commented 4 years ago

🚀 Feature Request for universal subwords with mBART

Thanks for releasing the BBPE code! Recent results of mBART consider a universal language space. But its vocabulary is limited to the language family seen during training. Do you think that BART is compatible with the work of @kahne, @kyunghyuncho and @MultiPath for Neural Machine Translation with Byte-Level Subwords? The transformer architecture is mostly the same, but the embeddings should be contextualized and recovered during decoding to valid character sequences.

The Byte-level subword vocabulary is more compact and improves the quality of multilingual translations. With BBPE the fine-tuning to new languages can assemble on a complete vocabulary. Or do you think it is feasible to gain a universal vocabulary with BPE by taking all language families into preprocessing?

Using mBART for unseen language (families) like Abkhazian results so far in an amount of unknown tokens:

2020-03-21 23:03:53 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] train.bpe.ab-ru.ab: 21845 sents, 1513056 tokens, 9.97% replaced by <unk>
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] valid.bpe.ab-ru.ab: 186 sents, 13200 tokens, 10.0% replaced by <unk>
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:02 | INFO | fairseq_cli.preprocess | [ab] test.bpe.ab-ru.ab: 31 sents, 2266 tokens, 8.16% replaced by <unk>

The preprocessing with xlm-R (persists on 100 languages) gives quite the same results:

fairseq_cli.preprocess | [ab] Dictionary: 250000 types
fairseq_cli.preprocess | [ab] train.bpe.ab-ru.ab: 22255 sents, 1648058 tokens, 9.97% replaced by <unk>
stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!