delip / PyTorchNLPBook

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
Apache License 2.0
1.96k stars 799 forks source link

make_embedding_matrix assumes that the words are fed in the same order as the vocab #22

Open sumeetsk opened 4 years ago

sumeetsk commented 4 years ago

I'm studying 5_3_Document_Classification_with_CNN.

The make_embedding_matrix helper docs say that it should be fed in a list of words in the dataset. However, for the embedding matrix to return the correct embedding of a word from pretrained embeddings, the word list should be fed in the same order as in the vocabulary. Furthermore, there should be no gaps in the word indices in the vocabulary. These are big assumptions.

I think the correct way to construct the embedding matrix is to pass the vocab to the make_embedding_function, and use the token_to_idx method in the vocab to find which rows of the embedding matrix should be populated.

Correct me if I'm wrong.