Maluuba / newsqa

Tools for using Maluuba's NewsQA Dataset (public version)
https://www.microsoft.com/en-us/research/project/newsqa-dataset/
Other
253 stars 58 forks source link

Several methods not working (on mac) #19

Closed alontalmor closed 6 years ago

alontalmor commented 6 years ago

The manual setup does not produce : combined-newsqa-data-*.csv as stated in the readme. The docker file needs: RUN apt-get update && apt-get install -y apt-transport-https Or else it crashes in build.

Also the docker crashes at execution "Exception: /usr/src/newsqa/maluuba/newsqa/newsqa-data-v1.csv was not found." After it deletes the copy i've added of the file.

juharris commented 6 years ago

For finding combined-newsqa-data-*.csv: What was the output of python maluuba/newsqa/data_generator.py?

juharris commented 6 years ago

Thanks for pointing out the requirements for the Dockerfile. I just tried it out and all tests pass and all files are found. I'll merge the update soon.

It's intended that the Docker build deletes newsqa-data-v1.csv as per the comment in Dockerfile. We had issues with the files getting extracted from newsqa-data-v1.tar.gz so it's safer just to delete them before building in the specific environment. As long as the build finds newsqa-data-v1.tar.gz (the questions and answers) where the setup instructions say to put it, then everything should be fine.

I want to make sure that these tools work for you so please share the commands ran with more the output so that we can work through this.

alontalmor commented 6 years ago

Thanks for the reply! Regarding the maluuba/newsqa/data_generator.py it creates a directory containing 3 csv, for the train,dev and test tokenized data only.

Regarding the docker, i've changed the file i put in maluuba/newsqa/ from just the newsqa-data-v1.csv to the whole repo - newsqa-data-v1.tar.gz Now there is a different challenge: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

juharris commented 6 years ago

Earlier in the output of data_generator.py, it says where to find combined-newsqa-data-v1.csv (the root of the repo).

Those errors/warnings from the logger are fine.

alontalmor commented 6 years ago

Ok the docker works, thanks for the help!

Btw, what is the official split for train,dev,test? seems the docker generated one file for all the data.

juharris commented 6 years ago

Great!

That's right, by default the running the Docker container does not do the split. I'll add instructions to the README when I get a chance to fully test but this should work:

docker build -t maluuba/newsqa .
docker run --rm -it -v ${PWD}:/usr/src/newsqa --name newsqa maluuba/newsqa /bin/bash
# Now you are in the running container.
source activate newsqa
python maluuba/newsqa/data_generator.py
juharris commented 6 years ago

Well actually that generates the split data but it's tokenized. I prefer non-tokenized text with char-based indices so that I can use my own pre-processor and tokenizer. If you do too then you can get the story IDs for the split in maluuba/newsqa/{train,dev,test}_story_ids.csv.

EDIT: I'm working on producing a JSON format that will simplify this.