bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Truecase supported. #58

Closed BrightXiaoHan closed 1 year ago

BrightXiaoHan commented 3 years ago

I'm working on a machine translation task. When I encode corpus with bpemb, the output is always lower case. Is it possible to retain case information after encode my corpus?

bheinzerling commented 1 year ago

somehow missed this issue, but no, all embeddings are uncased / lower-case only