delip / PyTorchNLPBook

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
Apache License 2.0
1.96k stars 799 forks source link

5_1_Pretrained_Embeddings.ipynb notebook #35

Open mdzalfirdausi opened 2 years ago

mdzalfirdausi commented 2 years ago

file glove.6B.100d.txt from kaggle [link] the appropriate from_embeddings_file method:

def from_embeddings_file(cls, embedding_file):
    """Instantiate from pre-trained vector file.

    Vector file should be of the format:
        word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
        word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

    Args:
        embedding_file (str): location of the file
    Returns: 
        instance of PretrainedEmbeddigns
    """
    word_to_index = {}
    word_vectors = []

    with open(embedding_file, encoding="utf8") as fp:
        for line in fp.readlines():
            line = line.split(" ")
            word = line[0]
            vec = np.array([float(x) for x in line[1:]])

            word_to_index[word] = len(word_to_index)
            word_vectors.append(vec)

    return cls(word_to_index, word_vectors)