Closed aurooj closed 5 years ago
Hi There!
This code base works just fine:
>>> from torchnlp.word_to_vector import FastText
>>> vectors = FastText()
wiki.en.vec: 6.60GB [05:28, 21.4MB/s]
0%| | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header
100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s]
>>> vectors['derang']
tensor([ 0.3663, -0.2729, -0.5492, 0.2594, -0.2059, -0.6579, 0.3311, -0.3561,
-0.0211, -0.4950, 0.2345, 0.5009, 0.1284, -0.0284, 0.4262, 0.1306,
0.0736, -0.1482, 0.1071, 0.3749, -0.3396, 0.2189, -0.0933, -0.6236,
0.2598, 0.1215, 0.3682, 0.0977, 0.3826, 0.2483, 0.0497, 0.3010,
0.1354, -0.1132, 0.3291, 0.1183, 0.0862, -0.2852, -0.2880, 0.4053,
-0.2330, 0.4374, -0.0842, 0.1315, -0.1406, 0.1829, -0.1734, 0.2383,
0.1084, 0.0826, -0.2086, 0.1929, 0.4043, -0.0709, 0.0764, -0.2958,
0.0644, 0.4529, 0.0039, 0.0321, 0.2296, 0.1703, 0.3169, 0.3324,
-0.1998, 0.1265, -0.4961, -0.1126, 0.3073, -0.0775, 0.1673, -0.1065,
0.1746, -0.3484, -0.1683, 0.3709, 0.1794, -0.1061, -0.3025, 0.0797,
0.7037, -0.3384, 0.0654, 0.0047, 0.0675, 0.2268, -0.2287, -0.0502,
-0.1027, -0.1576, 0.0931, -0.5580, 0.3006, -0.6026, 0.0979, -0.1607,
0.2291, 0.2667, -0.2266, 0.3741, -0.3300, 0.2384, -0.1749, 0.1554,
-0.0474, 0.1531, -0.2938, 0.3155, 0.1208, -0.4494, 0.0461, 0.1716,
-0.3338, 0.1848, 0.2872, -0.4439, -0.0408, 0.0823, -0.3677, 0.0684,
0.1709, -0.2148, -0.0842, 0.4830, -0.2937, -0.0804, -0.1713, -0.1559,
-0.1759, 0.1321, 0.0048, 0.1698, 0.1019, 0.1963, 0.0649, -0.0431,
-0.3056, -0.2303, -0.2197, 0.0797, -0.1263, 0.2204, -0.0276, -0.0039,
0.2605, -0.0019, -0.0057, 0.3839, 0.5118, 0.0172, 0.1729, -0.0898,
0.1416, -0.4514, -0.0455, 0.2964, -0.1571, 0.5023, 0.0768, -0.3092,
-0.1937, 0.2595, -0.2484, 0.5232, -0.1842, -0.3832, -0.4159, -0.3071,
0.3744, 0.5791, 0.0642, -0.1190, -0.0598, 0.0508, 0.1179, 0.0383,
-0.3242, 0.1952, -0.0211, -0.1509, -0.4514, -0.1727, -0.0395, -0.4362,
0.3575, 0.1249, 0.0599, 0.0472, 0.6013, 0.1357, -0.0937, 0.1200,
0.1294, 0.4008, -0.1689, 0.1403, -0.7018, -0.0751, -0.6768, -0.1206,
0.5307, -0.0490, -0.1083, 0.2631, 0.0748, -0.1714, 0.1157, 0.3715,
0.6093, 0.3088, 0.4642, 0.0930, 0.0624, -0.0640, 0.1391, -0.7331,
-0.1361, -0.0859, -0.3891, 0.0768, -0.4963, 0.0695, -0.3626, 0.8411,
0.1532, -0.1458, -0.2630, -0.2151, -0.3103, 0.1697, -0.1632, -0.3756,
-0.0803, -0.1968, 0.5468, 0.1773, -0.2990, -0.0036, 0.0758, -0.3991,
-0.0524, 0.2814, -0.2947, -0.1843, 0.3038, 0.4715, -0.3175, 0.1851,
0.0134, -0.1914, 0.4584, 0.2807, 0.1590, 0.3280, 0.3517, 0.3911,
0.1309, -0.2509, -0.0008, -0.2097, 0.2152, 0.1403, 0.3071, 0.0773,
0.1583, -0.6938, 0.0017, -0.3672, 0.1968, 0.0241, -0.5667, 0.1639,
0.0899, -0.1899, -0.1444, 0.3414, 0.4791, 0.0642, 0.0116, -0.1053,
0.5087, 0.0990, 0.1311, 0.3384, -0.3098, -0.1424, -0.0206, -0.1233,
0.1623, -0.0964, -0.2188, 0.4343, 0.1835, -0.0482, -0.3140, 0.2048,
-0.0942, 0.0402, 0.0923, -0.1973])
You must have modified the wiki.en.vec
file. Try deleting it and rerunning rm -r .word_vectors_cache/wiki.en.vec
.
Thanks for your reply!
I am running into one more issue:
After downloading the pre-trained embeddings, when it starts loading them, my RAM gets filled up and then machine dies or gives me memory error. Same happens when I try loading GloVe.
I am not an expert in NLP or have any prior experience in text. All I want to do is to load pre-trained embeddings and features for the words in my dataset.
I tried on two machines with the following configurations: Machine1: Ubuntu 16.04 RAM 24GB Python 3.6.4 Pytorch 0.4.1
Machine 2: Ubuntu 14.04 RAM 16GB Python 3.6.6 Pytorch 0.4.1
wiki.en.vec: 6.60GB [05:28, 21.4MB/s] <-- [this step finishes successfully.] 0%| | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header 100%|██████████████████████████████████████████████████████| 2519371/2519371 [05:19<00:00, 7884.92it/s] <-- [My RAM starts filling up at this step resulting in freezing my machine or throwing the error I posted in this issue]
Your help is highly appreciated. Thanks.
Yup, this is a known problem. You are attempting to put into memory all 6 gigabytes of embeddings. I'd use is_include
to filter the embeddings by your vocabulary.
There are other more sophisticated options like so: https://github.com/vzhong/embeddings
Ah, I see. Thank you, I will try these solutions.
Expected Behavior
Load FastText vectors
Environment: Ubuntu 16.04 Python 3.6.4 Pytorch 0.4.1
Actual Behavior
Throws the following error:
File "", line 1, in
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in init
super(FastText, self).init(name, url=url, **kwargs)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 72, in init
self.cache(name, cache, url=url)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 153, in cache
word, len(entries), dim))
RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.
Steps to Reproduce the Problem
Open python console
Write the following code:
Throws the error mentioned above.