abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

Error in Tokenizing the CNN and DailyMotion stories #8

Closed JafferWilson closed 7 years ago

JafferWilson commented 7 years ago

Here is the error.

python make_datafiles.py cnn/stories/ dailymail/stories/
Preparing to tokenize cnn/stories/ to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories/ and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
    at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
    at java.io.BufferedWriter.write(BufferedWriter.java:221)
    at java.io.Writer.write(Writer.java:157)
    at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
    at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
    at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 238, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories/ (which has 92579 files). Was there an error during tokenization?

Kindly, help me.

tanaka-jp commented 7 years ago

I got this error too

tanaka-jp commented 7 years ago

The problem was solved. Don't use stanford-corenlp-full-2017-06-09. Use stanford-corenlp-full-2016-10-31.

JafferWilson commented 7 years ago

Ok I will try it out. Just for my information, can you please tell me what is the difference between stanford-corenlp-full-2017-06-09 and stanford-corenlp-full-2016-10-31. jar? I thought ultimately they are Standford NLP jars and might have everything similar. Kindly, let me know what exactly is the difference that made me use the 2016 version and not the 2017 version.

tanaka-jp commented 7 years ago

If everything is the same, I think that it is not necessary to upgrade.

JafferWilson commented 7 years ago

@tanaka-jp Then still I have the issue as I have mentioned above. Please do tell me what I need to do to make it work properly.

comckay commented 7 years ago

You do indeed need to use the specified version of CoreNLP (3.7) see here: https://github.com/stanfordnlp/CoreNLP/issues/460

You will experience this error if you use 3.8

JafferWilson commented 7 years ago

@comckay What is the reason? Why I cannot use the 3.8?

comckay commented 7 years ago

There is a known and acknowledged bug in the source code for that release. If you look at the referenced thread you will see the reporter says

I looked at the file PTBTokenizer.java, and looks like the output file is not being closed anywhere and apparently that is causing this exception.

The author has said he will fix what is currently on github so if you want bleeding edge you can download and compile from source. Otherwise, you can use the prebaked 3.7 release.