Closed alontalmor closed 6 years ago
For finding combined-newsqa-data-*.csv:
What was the output of python maluuba/newsqa/data_generator.py
?
Thanks for pointing out the requirements for the Dockerfile. I just tried it out and all tests pass and all files are found. I'll merge the update soon.
It's intended that the Docker build deletes newsqa-data-v1.csv as per the comment in Dockerfile. We had issues with the files getting extracted from newsqa-data-v1.tar.gz so it's safer just to delete them before building in the specific environment. As long as the build finds newsqa-data-v1.tar.gz (the questions and answers) where the setup instructions say to put it, then everything should be fine.
I want to make sure that these tools work for you so please share the commands ran with more the output so that we can work through this.
Thanks for the reply! Regarding the maluuba/newsqa/data_generator.py it creates a directory containing 3 csv, for the train,dev and test tokenized data only.
Regarding the docker, i've changed the file i put in maluuba/newsqa/ from just the newsqa-data-v1.csv to the whole repo - newsqa-data-v1.tar.gz Now there is a different challenge: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Earlier in the output of data_generator.py, it says where to find combined-newsqa-data-v1.csv (the root of the repo).
Those errors/warnings from the logger are fine.
Ok the docker works, thanks for the help!
Btw, what is the official split for train,dev,test? seems the docker generated one file for all the data.
Great!
That's right, by default the running the Docker container does not do the split. I'll add instructions to the README when I get a chance to fully test but this should work:
docker build -t maluuba/newsqa .
docker run --rm -it -v ${PWD}:/usr/src/newsqa --name newsqa maluuba/newsqa /bin/bash
# Now you are in the running container.
source activate newsqa
python maluuba/newsqa/data_generator.py
Well actually that generates the split data but it's tokenized. I prefer non-tokenized text with char-based indices so that I can use my own pre-processor and tokenizer. If you do too then you can get the story IDs for the split in maluuba/newsqa/{train,dev,test}_story_ids.csv.
EDIT: I'm working on producing a JSON format that will simplify this.
The manual setup does not produce : combined-newsqa-data-*.csv as stated in the readme. The docker file needs: RUN apt-get update && apt-get install -y apt-transport-https Or else it crashes in build.
Also the docker crashes at execution "Exception:
/usr/src/newsqa/maluuba/newsqa/newsqa-data-v1.csv
was not found." After it deletes the copy i've added of the file.