bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

multilingual text #19

Closed rohitsaluja22 closed 5 years ago

rohitsaluja22 commented 5 years ago

hey, thanks for sharing the code. I am working on the multilingual text. Can I give more than one language to segment words/sentences?

bheinzerling commented 5 years ago

All sentencepiece models were trained separately on monolingual corpora, so applying them to multilingual text probably won't give you good results unless the languages are very similar.

If you have access to GPUs, you could give multilingual BERT (or its pytorch version) a try.

bheinzerling commented 5 years ago

This is now possible with the new multilingual models: https://nlp.h-its.org/bpemb/multi/