bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

Continue training #36

Closed ericlingit closed 4 years ago

ericlingit commented 4 years ago

Is it possible to continue training with your pre-trained models?

From this page, it is stated that

BPEmb objects wrap a gensim KeyedVectors instance

and gensim's documentation mentions that:

The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don't need to continue training), the state can discarded, resulting in a much smaller and faster object ...

I'm assuming the answer is no? Please correct me if I'm wrong.

bheinzerling commented 4 years ago

That section in the gensim documentation refers to continuing training with gensim specifically. I didn't train the embeddings with gensim, but with GloVe, so this section doesn't really apply here. If you want to continue training the embeddings (this is usually called "finetuning"), you can load them in a deep learning framework like PyTorch:

>>> from torch import nn, tensor
>>> from bpemb import BPEmb
>>> bpemb_en = BPEmb(lang="en", vs=100000, dim=100)
>>> emb_layer = nn.Embedding.from_pretrained(tensor(bpemb_en.vectors))
>>> emb_layer
Embedding(100000, 100)