bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

tokenization only feature #28

Closed trideeprath closed 3 years ago

trideeprath commented 5 years ago

While initialization two models bpe_model and w2v model are downloaded. bpemb_en = BPEmb(lang="en", dim=50)

In some cases, the w2v model is not required but only the tokenization is required. For example, when training a text classifier with training the embeddings, now the w2v model in bpemb is not required but tokenization is required during inference. Is there a way the bpemb is initialized to be used only for encode method without the need to download/load the vectors.

It could be something like the following

bpemb_en = BPEmb(lang="en", dim=50, load_vectors=False)
bpemb_en.encode("There you go")
bheinzerling commented 3 years ago

Thanks for this suggestion! I added the following argument a while ago but for some reason didn't reply here:

    segmentation_only: ``bool'', optional (default = False)
        If set to True, only the SentencePiece subword segmentation
        model will be loaded. Use this flag if you do not need the
        subword embeddings.

So you can load the BPE model only like this:

bpemb_en = BPEmb(lang="en", dim=50, segmentation_only=True)