PetrochukM / PyTorch-NLP

Basic Utilities for PyTorch Natural Language Processing (NLP)
https://pytorchnlp.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.21k stars 258 forks source link

RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions. #57

Closed aurooj closed 5 years ago

aurooj commented 5 years ago

Expected Behavior

Load FastText vectors

Environment: Ubuntu 16.04 Python 3.6.4 Pytorch 0.4.1

Actual Behavior

Throws the following error:

File "", line 1, in File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in init super(FastText, self).init(name, url=url, **kwargs) File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 72, in init self.cache(name, cache, url=url) File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 153, in cache word, len(entries), dim)) RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

Steps to Reproduce the Problem

  1. Open python console

  2. Write the following code:

    
       from torchnlp.word_to_vector import FastText
       vectors = FastText()
  3. Throws the error mentioned above.

PetrochukM commented 5 years ago

Hi There!

This code base works just fine:

>>> from torchnlp.word_to_vector import FastText
>>> vectors = FastText()
wiki.en.vec: 6.60GB [05:28, 21.4MB/s]
  0%|                                                                      | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header
100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s]
>>> vectors['derang']
tensor([ 0.3663, -0.2729, -0.5492,  0.2594, -0.2059, -0.6579,  0.3311, -0.3561,
        -0.0211, -0.4950,  0.2345,  0.5009,  0.1284, -0.0284,  0.4262,  0.1306,
         0.0736, -0.1482,  0.1071,  0.3749, -0.3396,  0.2189, -0.0933, -0.6236,
         0.2598,  0.1215,  0.3682,  0.0977,  0.3826,  0.2483,  0.0497,  0.3010,
         0.1354, -0.1132,  0.3291,  0.1183,  0.0862, -0.2852, -0.2880,  0.4053,
        -0.2330,  0.4374, -0.0842,  0.1315, -0.1406,  0.1829, -0.1734,  0.2383,
         0.1084,  0.0826, -0.2086,  0.1929,  0.4043, -0.0709,  0.0764, -0.2958,
         0.0644,  0.4529,  0.0039,  0.0321,  0.2296,  0.1703,  0.3169,  0.3324,
        -0.1998,  0.1265, -0.4961, -0.1126,  0.3073, -0.0775,  0.1673, -0.1065,
         0.1746, -0.3484, -0.1683,  0.3709,  0.1794, -0.1061, -0.3025,  0.0797,
         0.7037, -0.3384,  0.0654,  0.0047,  0.0675,  0.2268, -0.2287, -0.0502,
        -0.1027, -0.1576,  0.0931, -0.5580,  0.3006, -0.6026,  0.0979, -0.1607,
         0.2291,  0.2667, -0.2266,  0.3741, -0.3300,  0.2384, -0.1749,  0.1554,
        -0.0474,  0.1531, -0.2938,  0.3155,  0.1208, -0.4494,  0.0461,  0.1716,
        -0.3338,  0.1848,  0.2872, -0.4439, -0.0408,  0.0823, -0.3677,  0.0684,
         0.1709, -0.2148, -0.0842,  0.4830, -0.2937, -0.0804, -0.1713, -0.1559,
        -0.1759,  0.1321,  0.0048,  0.1698,  0.1019,  0.1963,  0.0649, -0.0431,
        -0.3056, -0.2303, -0.2197,  0.0797, -0.1263,  0.2204, -0.0276, -0.0039,
         0.2605, -0.0019, -0.0057,  0.3839,  0.5118,  0.0172,  0.1729, -0.0898,
         0.1416, -0.4514, -0.0455,  0.2964, -0.1571,  0.5023,  0.0768, -0.3092,
        -0.1937,  0.2595, -0.2484,  0.5232, -0.1842, -0.3832, -0.4159, -0.3071,
         0.3744,  0.5791,  0.0642, -0.1190, -0.0598,  0.0508,  0.1179,  0.0383,
        -0.3242,  0.1952, -0.0211, -0.1509, -0.4514, -0.1727, -0.0395, -0.4362,
         0.3575,  0.1249,  0.0599,  0.0472,  0.6013,  0.1357, -0.0937,  0.1200,
         0.1294,  0.4008, -0.1689,  0.1403, -0.7018, -0.0751, -0.6768, -0.1206,
         0.5307, -0.0490, -0.1083,  0.2631,  0.0748, -0.1714,  0.1157,  0.3715,
         0.6093,  0.3088,  0.4642,  0.0930,  0.0624, -0.0640,  0.1391, -0.7331,
        -0.1361, -0.0859, -0.3891,  0.0768, -0.4963,  0.0695, -0.3626,  0.8411,
         0.1532, -0.1458, -0.2630, -0.2151, -0.3103,  0.1697, -0.1632, -0.3756,
        -0.0803, -0.1968,  0.5468,  0.1773, -0.2990, -0.0036,  0.0758, -0.3991,
        -0.0524,  0.2814, -0.2947, -0.1843,  0.3038,  0.4715, -0.3175,  0.1851,
         0.0134, -0.1914,  0.4584,  0.2807,  0.1590,  0.3280,  0.3517,  0.3911,
         0.1309, -0.2509, -0.0008, -0.2097,  0.2152,  0.1403,  0.3071,  0.0773,
         0.1583, -0.6938,  0.0017, -0.3672,  0.1968,  0.0241, -0.5667,  0.1639,
         0.0899, -0.1899, -0.1444,  0.3414,  0.4791,  0.0642,  0.0116, -0.1053,
         0.5087,  0.0990,  0.1311,  0.3384, -0.3098, -0.1424, -0.0206, -0.1233,
         0.1623, -0.0964, -0.2188,  0.4343,  0.1835, -0.0482, -0.3140,  0.2048,
        -0.0942,  0.0402,  0.0923, -0.1973])

You must have modified the wiki.en.vec file. Try deleting it and rerunning rm -r .word_vectors_cache/wiki.en.vec.

aurooj commented 5 years ago

Thanks for your reply!

I am running into one more issue:

After downloading the pre-trained embeddings, when it starts loading them, my RAM gets filled up and then machine dies or gives me memory error. Same happens when I try loading GloVe.

I am not an expert in NLP or have any prior experience in text. All I want to do is to load pre-trained embeddings and features for the words in my dataset.

I tried on two machines with the following configurations: Machine1: Ubuntu 16.04 RAM 24GB Python 3.6.4 Pytorch 0.4.1

Machine 2: Ubuntu 14.04 RAM 16GB Python 3.6.6 Pytorch 0.4.1

wiki.en.vec: 6.60GB [05:28, 21.4MB/s] <-- [this step finishes successfully.] 0%| | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header 100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s] <-- [My RAM starts filling up at this step resulting in freezing my machine or throwing the error I posted in this issue]

Your help is highly appreciated. Thanks.

PetrochukM commented 5 years ago

Yup, this is a known problem. You are attempting to put into memory all 6 gigabytes of embeddings. I'd use is_include to filter the embeddings by your vocabulary.

There are other more sophisticated options like so: https://github.com/vzhong/embeddings

aurooj commented 5 years ago

Ah, I see. Thank you, I will try these solutions.