Maluuba / newsqa

Tools for using Maluuba's NewsQA Dataset (public version)
https://www.microsoft.com/en-us/research/project/newsqa-dataset/
Other
253 stars 58 forks source link

I follow the recomended steps, but i can't make it work #39

Closed GustavoJE closed 3 years ago

GustavoJE commented 3 years ago

So i download the csv from Microsoft's site, which btw is not a tar.gz, then i download "cnn.tgz" and "cnn_stories.tgz" and put them into maluuba/newsqa folder with "newsqa-data-v1.csv". Then i build the docker and finally run it. However i get the following error:

`EE

ERROR: setUpClass (maluuba.newsqa.tests.test_tokenize.TestNewsQaTokenize)

Traceback (most recent call last): File "/usr/src/newsqa/maluuba/newsqa/tests/test_tokenize.py", line 32, in setUpClass NewsQaDataset().dump(path=combined_data_path) File "/usr/src/newsqa/maluuba/newsqa/data_processing.py", line 80, in init "\n See the README in the root of this repo for more details." % dataset_path) Exception: /usr/src/newsqa/maluuba/newsqa/newsqa-data-v1.csv was not found. For legal reasons, you must first accept the terms and download the dataset from https://msropendata.com/datasets/939b1042-6402-4697-9c15-7a28de7e1321 See the README in the root of this repo for more details.

====================================================================== ERROR: setUpClass (maluuba.newsqa.tests.test_newsqa.TestNewsQa)

Traceback (most recent call last): File "/usr/src/newsqa/maluuba/newsqa/tests/test_newsqa.py", line 36, in setUpClass cls.newsqa_dataset = NewsQaDataset() File "/usr/src/newsqa/maluuba/newsqa/data_processing.py", line 80, in init "\n See the README in the root of this repo for more details." % dataset_path) Exception: /usr/src/newsqa/maluuba/newsqa/newsqa-data-v1.csv was not found. For legal reasons, you must first accept the terms and download the dataset from https://msropendata.com/datasets/939b1042-6402-4697-9c15-7a28de7e1321 See the README in the root of this repo for more details.


Ran 0 tests in 0.001s

FAILED (errors=2) `

juharris commented 3 years ago

Thanks for trying newsqa! Right we should update the instructions about the .tar.gz.

Can you share the full Docker command that you ran AND where you run it from (it should be from the root of the repo)? I suspect that there could be a problem with the mounting (-v parameter). Try:

docker run --rm -it -v ${PWD}:/usr/src/newsqa --name newsqa maluuba/newsqa /bin/bash --login -c 'ls /usr/src/newsqa/maluuba/newsqa'

You can also try to set up the -v parameter explicitly instead of using ${PWD}.

juharris commented 3 years ago

Oh I think I see what happened with the download, there are a few options and you're not really required to do the tar.gz option: image

GustavoJE commented 3 years ago

oh i didn't see that option. So i should have put the tar.gz file instead of the .csv in maluuba/newsqa folder? will try it as soon as i can and update this issue. Thank you!

juharris commented 3 years ago

It should work with just the .csv.

GustavoJE commented 3 years ago

I just ran the command from the repo root like this:

docker run --rm -it -v "${PWD}:/usr/src/newsqa" --name newsqa maluuba/newsqa /bin/bash --login -c 'ls /usr/src/newsqa/maluuba/newsqa' (note the added quotes on the -v argument)

and got:

TokenizerSplitter.java cnn.tgz data_generator.py dev_story_ids.csv simplify.py split_dataset.py stories_requiring_two_extra_newlines.csv test_story_ids.csv tokenize_dataset.py __init__.py cnn_stories.tgz data_processing.py newsqa-data-v1.csv span_utils.py stories_requiring_extra_newline.csv stories_to_decode_specially.csv tests train_story_ids.csv

will now try defining explicitly

juharris commented 3 years ago

That looks good, I see "newsqa-data-v1.csv" there so maybe the double quotes helped? Maybe you have a space in your ${PWD}? Try to run the original docker run command that gave you issues but use the double quotes for the -v parameter.

GustavoJE commented 3 years ago

if i use the original command with the double quotes then i get the errors posted first, the test complains that it can't find newsqa-data-v1.csv. I forgot to mention that

juharris commented 3 years ago

Weird. Idk why this is happening. Can you try with the newsqa-data-v1.tar.gz? It also goes in the maluuba/newsqa folder.

GustavoJE commented 3 years ago

if i add the tar.gz file it works as expected. It was my mistake, sorry. Thanks again!

juharris commented 3 years ago

That's weird that it didn't work with the .csv but I'm glad it works for you now!