bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Adding support for own models #38

Closed stephantul closed 4 years ago

stephantul commented 4 years ago

Hi,

First of all, thanks for the great package. Currently, the only way to use my own models with bpemb is to first load another model, and then assign the .spm and .emb attributes manually. This is a bit unwieldy.

I am interested in adding a subclass of BPEmb that overrides the __init__ of BPEmb and simply accepts paths to an spm and emb model/file, from which the other attributes (e.g. size/vs) are derived. Is this something you would accept as a PR? Do you see any problems with this approach?

Thanks! Stéphan

bheinzerling commented 4 years ago

Hi,

Thanks for this suggestion! I've added two arguments to BPEmb.__init__:

    model_file: ``Path'', optional (default = None)
        Path to a custom SentencePiece model file.
    emb_file: ``Path'', optional (default = None)
        Path to a custom embedding file. Supported formats are Word2Vec
        plain text and GenSim binary.

Can you checkout the latest commit and let me know if this feature works for you? If yes, I'll update the pypi package as well.

stephantul commented 4 years ago

Yep, seems to work! I did some tests and everything gives the correct results. Thanks for the swift reply

jameschartouni commented 4 years ago

Can you add a .spm vocabulary to enlarge BPemb's multi-language model? For instance can you add Lebanese vocabulary in addition to the already available MSA, Egyptian and Aramaic?