tokenization only feature

bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

MIT License

1.18k stars 101 forks source link

While initialization two models bpe_model and w2v model are downloaded. bpemb_en = BPEmb(lang="en", dim=50)

In some cases, the w2v model is not required but only the tokenization is required. For example, when training a text classifier with training the embeddings, now the w2v model in bpemb is not required but tokenization is required during inference. Is there a way the bpemb is initialized to be used only for encode method without the need to download/load the vectors.

It could be something like the following

bpemb_en = BPEmb(lang="en", dim=50, load_vectors=False)
bpemb_en.encode("There you go")

segmentation_only: ``bool'', optional (default = False) If set to True, only the SentencePiece subword segmentation model will be loaded. Use this flag if you do not need the subword embeddings.

bheinzerling / bpemb

tokenization only feature #28