aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

loading glove produced files into polyglot? #38

Closed yina closed 8 years ago

yina commented 8 years ago

glove produces data of the form shown below. how do I load these glove produced files into polyglot so I can take advantage of the polyglot infrastructure?

in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.036121 0.13085 0.0012462 0.14769 0.26926 0.37144 1.3501 -0.11326 -0.23036 -0.26575 -0.18077 0.092455 -0.16215 0.15003 -0.34547 0.072295 0.40659 0.010021 -0.0079257 -0.11435 0.017008 -0.29789 0.19079 0.37112 -0.26588 0.16212 0.065469 -0.31781 -0.03226 0.081969 0.3445 -0.17362 -0.35745 0.054487 0.39941 0.13699 -0.022066 0.11025 -0.41898 0.1276 -0.095869 -0.17944 -0.17443 0.27302 -0.19464 0.26747 -0.28241 0.1638 -0.11518 0.013196 -0.10616 -0.36093 0.023634 0.13464 0.021652 -0.27094 -0.018737 0.10017 0.36071 -0.093951 0.47634 0.12874 0.0011868 0.1377 -0.14034 -0.1887 -0.16405 -0.15349 0.32347 -0.17616 0.3523 -0.023531 -0.19121 -0.054809 -0.099521 -0.30056 0.36632 -0.21509 0.074123 -0.20267 0.1286 -0.38111 -0.025482 0.45103 0.088633 0.36288 -0.23406 -0.086024 -0.50604 0.034242 0.43998 -0.083023 -0.11969 0.68686 -0.34115 0.21228 0.40039 0.26367 -0.37144 0.16206 -0.42854 0.078658 -0.2905 0.21727 -0.27484 0.35887 0.27055 -0.11326 -0.14848 -0.0050659 -0.076862 0.078621 -0.24922 0.42026 -0.069698 0.071595 0.0071665 0.27473 -0.15664 0.25713 -0.058461 -0.29733 -0.090996 0.5246 0.14889 -0.20883 -0.13004 -0.20022 0.4503 -0.34654 -0.26007 0.35247 -0.34757 0.033738 0.19907 -0.32912 -0.084689 0.65319 0.20954 0.079274 0.1086 0.0026466 -0.12843 -0.22811 0.051501 -0.27429 0.14505 -0.1843 -0.34825 -0.11701 0.34034 0.075848 0.08239 -0.39188 -0.022312 -0.080373 0.14477 0.29701 -0.10523 0.092893 0.029813 -0.11761 0.16308 0.098382 0.46152 -0.162 -0.2456 0.20293 -0.11344 0.057902 -0.19528 -0.20141 -0.22874 -0.014101 0.2637 -0.10028 -0.051896 0.18859 -0.17767 -0.11556 0.121 0.17303 0.11773 0.034837 0.28485 -0.30447 0.061024 -0.26442 -0.081135 -0.044524 -0.036931 -0.15217 0.29175 0.44926 -0.28875 0.33193 -0.01242 -0.18805 -0.19832 -0.19736 0.26893 0.11106 -0.67383 -0.1518 -0.16615 -0.16563 0.0093671 -0.15945 -0.33468 0.22038 -0.16724 -0.1535 -0.61782 -0.17258 0.088928 0.019411 0.18296 0.32967 -0.0024906 -0.09208 0.514 0.0042484 -0.084377 -0.71448 -0.22148 -0.04835 0.043761 -0.29376 -0.22287 0.18001 0.072197 0.46499 0.056466 0.40844 -0.23641 -0.038946 0.087363 -0.21901 -0.3231 -0.19989 -0.3128 -0.067656 -0.22596 0.090926 0.28365 0.31462 0.46082 -0.024871 -0.14605 0.30454 0.17704 -0.011311 0.26807 -0.032461 -0.16644 -0.15313 -0.20426 -0.3082 -0.2459 0.085848 -0.11767 -0.063056 -0.18133 -0.18629 -0.17694 0.29618 0.35987 0.0020102 0.38616 0.36712 -0.055112 -0.34733 -0.072678 -0.051119 -0.29069 0.053598 0.019587 0.16808 -0.27456 -0.097179 -0.054541 0.19229 -0.48128 -0.20304 0.19368 -0.32546 0.14421 -0.169 0.26501

aboSamoor commented 8 years ago

You can use polyglot to load the embeddings file or even use gensim. The embedding will not be useful for POS or NER task as they already using different set of embeddings.

On Wed, Dec 16, 2015 at 1:09 PM yina notifications@github.com wrote:

glove produces data of the form shown below. how do I load these glove produced files into polyglot so I can take advantage of the polyglot infrastructure?

in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.036121 0.13085 0.0012462 0.14769 0.26926 0.37144 1.3501 -0.11326 -0.23036 -0.26575 -0.18077 0.092455 -0.16215 0.15003 -0.34547 0.072295 0.40659 0.010021 -0.0079257 -0.11435 0.017008 -0.29789 0.19079 0.37112 -0.26588 0.16212 0.065469 -0.31781 -0.03226 0.081969 0.3445 -0.17362 -0.35745 0.054487 0.39941 0.13699 -0.022066 0.11025 -0.41898 0.1276 -0.095869 -0.17944 -0.17443 0.27302 -0.19464 0.26747 -0.28241 0.1638 -0.11518 0.013196 -0.10616 -0.36093 0.023634 0.13464 0.021652 -0.27094 -0.018737 0.10017 0.36071 -0.093951 0.47634 0.12874 0.0011868 0.1377 -0.14034 -0.1887 -0.16405 -0.15349 0.32347 -0.17616 0.3523 -0.023531 -0.19121 -0.054809 -0.099521 -0.30056 0.36632 -0.21509 0.074123 -0.20267 0.1286 -0.38111 -0.025482 0.45103 0.088633 0.36288 -0.23406 -0.086024 -0.50604 0.034242 0.43998 -0.083023 -0.11969 0.68686 -0.34115 0.21228 0.40039 0.26367 -0.37144 0.16206 -0.42854 0.078658 -0.2905 0.2 1727 -0.27484 0.35887 0.27055 -0.11326 -0.14848 -0.0050659 -0.076862 0.078621 -0.24922 0.42026 -0.069698 0.071595 0.0071665 0.27473 -0.15664 0.25713 -0.058461 -0.29733 -0.090996 0.5246 0.14889 -0.20883 -0.13004 -0.20022 0.4503 -0.34654 -0.26007 0.35247 -0.34757 0.033738 0.19907 -0.32912 -0.084689 0.65319 0.20954 0.079274 0.1086 0.0026466 -0.12843 -0.22811 0.051501 -0.27429 0.14505 -0.1843 -0.34825 -0.11701 0.34034 0.075848 0.08239 -0.39188 -0.022312 -0.080373 0.14477 0.29701 -0.10523 0.092893 0.029813 -0.11761 0.16308 0.098382 0.46152 -0.162 -0.2456 0.20293 -0.11344 0.057902 -0.19528 -0.20141 -0.22874 -0.014101 0.2637 -0.10028 -0.051896 0.18859 -0.17767 -0.11556 0.121 0.17303 0.11773 0.034837 0.28485 -0.30447 0.061024 -0.26442 -0.081135 -0.044524 -0.036931 -0.15217 0.29175 0.44926 -0.28875 0.33193 -0.01242 -0.18805 -0.19832 -0.19736 0.26893 0.11106 -0.67383 -0.1518 -0.16615 -0.16563 0.0093671 -0.15945 -0.33468 0.22038 -0.16724 -0.1535 -0.61782 -0.17258 0.088928 0.019411 0.18296 0.32 967 -0.0024906 -0.09208 0.514 0.0042484 -0.084377 -0.71448 -0.22148 -0.04835 0.043761 -0.29376 -0.22287 0.18001 0.072197 0.46499 0.056466 0.40844 -0.23641 -0.038946 0.087363 -0.21901 -0.3231 -0.19989 -0.3128 -0.067656 -0.22596 0.090926 0.28365 0.31462 0.46082 -0.024871 -0.14605 0.30454 0.17704 -0.011311 0.26807 -0.032461 -0.16644 -0.15313 -0.20426 -0.3082 -0.2459 0.085848 -0.11767 -0.063056 -0.18133 -0.18629 -0.17694 0.29618 0.35987 0.0020102 0.38616 0.36712 -0.055112 -0.34733 -0.072678 -0.051119 -0.29069 0.053598 0.019587 0.16808 -0.27456 -0.097179 -0.054541 0.19229 -0.48128 -0.20304 0.19368 -0.32546 0.14421 -0.169 0.26501

— Reply to this email directly or view it on GitHub https://github.com/aboSamoor/polyglot/issues/38.

yina commented 8 years ago

thank you for the comment. i tried to load the files directly without luck with the code below but it didn't work. I also tried to unzip it and then load it with no luck either. feel like i'm missing something obvious.

from polyglot.mapping import Embedding
embeddings = Embedding.load("glove.840B.300d.zip")
neighbors = embeddings.nearest_neighbors("green")
alantian commented 8 years ago

Now loading from GloVe models is supported upon https://github.com/aboSamoor/polyglot/commit/a5b4c9074c9611605a6c5db50cd7b9b9e0cc4ad8

And documents are also updated.