PetrochukM / PyTorch-NLP

Basic Utilities for PyTorch Natural Language Processing (NLP)
https://pytorchnlp.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.21k stars 258 forks source link

Inefficient embedding loading code in README.md #89

Closed JamieCT closed 4 years ago

JamieCT commented 4 years ago

In the readme where loading the glove embeddings is demonstrated, the lambda function is extremely inefficient. The encoder vocab is converted in to a set every single time the function is called. Instead of:

import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in set(encoder.vocab))
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
    embedding_weights[i] = pretrained_embedding[token]

it should be:

import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

vocab_set = set(encoder.vocab)

pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
    embedding_weights[i] = pretrained_embedding[token]

For my personal code (where the vocab was much much larger than the example) this changed the time taken to load the embeddings from 2 hours to ~10 seconds.

babyhandzzz commented 4 years ago

@JamieCT, Thank you its day and night, does cache work well for you?

PetrochukM commented 4 years ago

Good catch. I think I prioritized code simplicity a bit too much. Sorry guys!

I'll change the README in one second.

PetrochukM commented 4 years ago

Fixed!