commonsense / conceptnet-numberbatch

Other
1.28k stars 143 forks source link

error while getting datasets with "git annex get" #38

Closed mmallad closed 7 years ago

mmallad commented 7 years ago

I have done all setup and tried to get dataset with git annex command but it says

get code/source-data/conceptnet5.5.csv Remote origin not usable by git-annex; setting annex-ignore (not available) No other repository is known to contain the file. failed get code/source-data/conceptnet5.csv (not available) No other repository is known to contain the file. failed get code/source-data/ensemble-2016-03.300d.npy (not available) No other repository is known to contain the file. failed get code/source-data/ensemble-2016-03.600d.npy (not available) No other repository is known to contain the file. failed get code/source-data/ensemble-2016-03.labels (not available) No other repository is known to contain the file. failed get code/source-data/glove.42B.300d.txt (not available) No other repository is known to contain the file. failed get code/source-data/glove12.840B.300d.txt (not available) No other repository is known to contain the file. failed get code/source-data/ppdb-xl-lexical.csv (not available) No other repository is known to contain the file. failed get code/source-data/w2v-google-news.bin.gz (not available) No other repository is known to contain the file. failed git-annex: get: 9 failed

rspeer commented 7 years ago

Sorry! I was cleaning up old branches, and thought git-annex was a feature branch from when I introduced git-annex, but actually it's the branch where git-annex keeps all its information about where to find files.

I've restored the branch and it should work now. But man, I am not using this inscrutable, haphazardly-documented utility again.

mmallad commented 7 years ago

Thank you for reply. I downloaded all dataset by my self. I fixed some format issues and now its running all okay. I would like to ask some questions about it can you please provide your email address. Thank You

jatin270 commented 6 years ago

Hey can you tell me how you downloaded the datasets

rspeer commented 6 years ago

@jatin270 Can you be more specific about what you're looking for?

Currently, all parts of ConceptNet, including Numberbatch, are built using the code in https://github.com/commonsense/conceptnet5. Its build script will download the input data from Zenodo: https://zenodo.org/record/998169

jatin270 commented 6 years ago

@rspeer can you tell me format of conceptnet5.csv file like in what way have they stored the data?

rspeer commented 6 years ago

I'm really going to need you to be more specific what you're asking about, but is this what you're looking for? https://github.com/commonsense/conceptnet5/wiki/Downloads

jatin270 commented 6 years ago

@rspeer get source-data/w2v-google-news.bin.gz (from web...)

Unable to access these remotes: web

Try making some of these repositories available: 00000000-0000-0000-0000-000000000001 -- web 2feaff51-0a4a-4afc-8b56-b9a553161e49 -- rspeer@buffy:~/conceptnet-retrofitting-paper 54f3fff2-290c-42f1-908f-a6b2c9785668 -- media-lab-rsync 7ffdd42d-dce8-4cb8-904e-d09097500dfa -- rspeer@ip-10-23-1-47:~/conceptnet-retrofitting-paper 91510204-049b-4033-a6cf-0fe419754978 -- mungojerrie 2.7TB HD dd2f35a5-4cde-4d3b-a6a1-69167174aea0 -- rspeer@buffy:/home/rspeer/conceptnet-retrofitting-paper

(Note that these git remotes have annex-ignore set: origin) failed

I am obtaining this error

How can I fix this

tukeyclothespin commented 6 years ago

I have the same 'Unable to access these remotes: web' response to 'git annex get' as @jatin270 for all of the datasets.

'git annex whereis' shows references to http://conceptnet-api-1.media.mit.edu, which does not resolve to an IP.

`` whereis source-data/w2v-google-news.bin.gz (5 copies) 00000000-0000-0000-0000-000000000001 -- web 2feaff51-0a4a-4afc-8b56-b9a553161e49 -- rspeer@buffy:/conceptnet-retrofitting-paper 54f3fff2-290c-42f1-908f-a6b2c9785668 -- media-lab-rsync 7ffdd42d-dce8-4cb8-904e-d09097500dfa -- rspeer@ip-10-23-1-47:~/conceptnet-retrofitting-paper 91510204-049b-4033-a6cf-0fe419754978 -- mungojerrie 2.7TB HD

The following untrusted locations may also have copies: dd2f35a5-4cde-4d3b-a6a1-69167174aea0 -- rspeer@buffy:/home/rspeer/conceptnet-retrofitting-paper

web: http://conceptnet-api-1.media.mit.edu/downloads/annex/vector-ensemble/033/71c/SHA256E-s1647046227--21c05ae916a67a4da59b1d006903355cced7de7da1e42bff9f0504198c748da8.bin.gz/SHA256E-s1647046227--21c05ae916a67a4da59b1d006903355cced7de7da1e42bff9f0504198c748da8.bin.gz ok ``

It looks like some of them are in https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/ :

https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/glove12.840B.300d.txt.gz https://s3.amazonaws.com/conceptnet/raw-data/2016/vectors/GoogleNews-vectors-negative300.bin.gz

You can peruse the list at https://s3.amazonaws.com/conceptnet/

rspeer commented 6 years ago

@tukeyclothespin, could you tell me where you encountered the directions to use git annex? Those directions are very old and I need to get rid of things that point to them.

The built data in .csv format is available from https://github.com/commonsense/conceptnet5/wiki/Downloads, and the raw input data is hosted at https://zenodo.org/record/1165009.

Unlike git-annex, Zenodo is very well suited for long-term data hosting.

tukeyclothespin commented 6 years ago

Hi @rspeer thanks for responding.

I was attempting to build your original ConceptNet-Numberbatch from your paper. Your main readme (https://github.com/commonsense/conceptnet-numberbatch/blob/master/README.md) states to use branch 16.04 to recreate your 2016 paper. I encountered the git annex issue following the readme in branch 16.04 (https://github.com/commonsense/conceptnet-numberbatch/blob/16.04/README.md).

I wasn't able to resolve the git annex issue so I tried to download Glove, Word2Vec and PPDB into a folder myself and while 'python ninja.py' doesn't complain, ninja segfaults immediately. Do I understand you correctly that the conceptnet-raw-data-5.6.zip and conceptnet-assertions-5.5.5.csv.gz files from your links above contain the data files that the git annex step previously pulled down?

I am trying to recreate your 2016 paper as I have two word embedding models (Word2Vec and FastText) that I have trained on a specialized vocabulary and ConceptNet-Numberbatch looks very appealing as a way to fuse the generalized vocabulary from Glove and Google News Word2Vec with my models.

rspeer commented 6 years ago

Ah, no, the .csv.gz file was in response to what @jatin270 was looking for.

I can look for the files that are inputs to the 2016 paper, but probably nothing I do will save git-annex from the bit rot that's seemingly designed into it.

I'm supposing you went with the 2016 paper because the instructions for reproducing the AAAI 2017 paper using Docker were too daunting? I'm working on the process, on making it possible to build ConceptNet Numberbatch without all the sysadminnery, but it's going to be a newer (better) version, not an exact replication.

tukeyclothespin commented 6 years ago

Yes, I went with the 2016 paper and branch 16.04 because the instructions were more approachable, I don't need the web api features from Conceptnet, and I can run docker but not docker compose on my infrastructure. I am going to try the raw build instructions at https://github.com/commonsense/conceptnet5/wiki/Build-process.

Beyond that, I am open to building and running any version of ConceptNet-Numberbatch. My goal is to add my pretrained word embedding models of specialized vocabulary into the ConceptNet-Numberbatch build and evaluate the word embedding results via our own use case metric that we already have defined. I just need access to the terms and vectors output by ConceptNet-Numberbatch to see how they score on our metric. That's why the simplicity of the branch 16.04 instructions was appealing.

Can you explain how I build Conceptnet Numberbatch using the conceptnet5.vectors package per the readme? I have the conceptnet5 package installed and can import conceptnet5.vectors at the python3 interpreter but help(conceptnet5.vectors) shows functions related to vector comparisons.

"Since 2016, the code for building ConceptNet Numberbatch is part of the ConceptNet code base, in the conceptnet5.vectors package."

rspeer commented 6 years ago

I'll work on updating the documentation. I've put the data files you need up on Zenodo: https://zenodo.org/record/1208722

Download these files and put them in your source-data directory, and you should be able to run the 16.04 build.

Note that this is for reproducing the paper, and the distinct feature of this paper compared to the others is that we tried various combinations of data sources and parameters. The build process, as described, will build all of them.

rspeer commented 6 years ago

The modern build process, using the conceptnet5 package, is described at: https://github.com/commonsense/conceptnet5/wiki/Build-process

tukeyclothespin commented 6 years ago

Thanks for your patience, I really appreciate it. I downloaded the data files from your link, put them into the source-data directory of my 16.04 version, ran python ninja.py and then sudo ninja but get a segfault nearly immediately. I am going to take a break from looking at that because I was separately able to work through the build process using the current conceptnet5 package.

I noticed your comment in the google group to run install the dependencies and run snakemake data/vectors/numberbatch.h5 to build the Numberbatch vectors. That worked and I was able to load the Numberbatch vectors in gensim!

Now I want to add my own pretrained word embeddings into Numberbatch this time so I put a gzipped file of the word2vec text format of my model terms/vectors in data/raw/vectors as you suggested in the google group thread. During snakemake I saw:

rule convert_word2vec: input: data/raw/vectors/GoogleNews-vectors-negative300.bin.gz output: data/vectors/w2v-google-news.h5 jobid: 14 resources: ram=24

But I never saw snakemake run convert_word2vec on the word2vec gz file I had in data/raw/vectors. The initial job list also stated that there was only one job for convert_word2vec. Do I need to add my new file name in a conceptnet or snakemake configuration script? I noticed the four other input embeddings are mentioned in the Snakefile: INPUT_EMBEDDINGS = [ 'crawl-300d-2M', 'w2v-google-news', 'glove12-840B', 'fasttext-opensubtitles' ]

tukeyclothespin commented 6 years ago

I think I have it: 1) Put my pretrained term/vector output file in data/raw/vectors/ 1) Make a new rule in the snakefile with the input as my file in data/raw/vectors/ and the output as a h5 file name in data/vectors using either convert_word2vec or convert_fasttext as the template depending on whether the file is binary or text. 2) Add the name of the h5 file output from the new rule to the INPUT_EMBEDDINGS list in the snakefile 3) snakemake clean 4) snakemake data/vectors/numberbatch.h5