In the readme where loading the glove embeddings is demonstrated, the lambda function is extremely inefficient. The encoder vocab is converted in to a set every single time the function is called. Instead of:
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in set(encoder.vocab))
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
it should be:
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
For my personal code (where the vocab was much much larger than the example) this changed the time taken to load the embeddings from 2 hours to ~10 seconds.
In the readme where loading the glove embeddings is demonstrated, the lambda function is extremely inefficient. The encoder vocab is converted in to a set every single time the function is called. Instead of:
it should be:
For my personal code (where the vocab was much much larger than the example) this changed the time taken to load the embeddings from 2 hours to ~10 seconds.