abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
634 stars 306 forks source link

error while running make_datafile.py #16

Open 97yogitha opened 7 years ago

97yogitha commented 7 years ago

@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories

Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
    at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
    at java.io.BufferedWriter.write(BufferedWriter.java:221)
    at java.io.Writer.write(Writer.java:157)
    at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
    at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
    at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 235, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`
JafferWilson commented 7 years ago

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

ibarrien commented 7 years ago

I had a similar issue, though not sure if it's the same cause. See: https://github.com/abisee/cnn-dailymail/issues/12

On Sun, Oct 29, 2017 at 8:29 PM, Jaffer Wilson notifications@github.com wrote:

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/abisee/cnn-dailymail/issues/16#issuecomment-340334829, or mute the thread https://github.com/notifications/unsubscribe-auth/AHM9T-dG4T41-xYVGiyZCN2ZD412WrNAks5sxUKzgaJpZM4QJzMu .

JafferWilson commented 7 years ago

I have created already the processed file you can try that without any issue. Here is the link: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail Use Python 2.7

97yogitha commented 7 years ago

@JafferWilson yes i am using stanford-corenlp-full-2017-09-0/stanford-corenlp-3.8.0.jar. I will use the processed file.

JafferWilson commented 7 years ago

@97yogitha No do not use the 2017 one.. use 2016 which is mentioned in the Read.me file of the repository.

IreneZihuiLi commented 6 years ago

@JafferWilson Thanks for the help. I used 3.7.0 from https://stanfordnlp.github.io/CoreNLP/history.html and it worked.

Neuqmiao commented 6 years ago

thanks very much, today I encountered this problem with the newest version 3.8.0, and then I changed to 3.7.0, finally, it worked.

JafferWilson commented 6 years ago

Please some one close this issue.

Sharathnasa commented 6 years ago

@JafferWilson Could you help in running the nueral network against our own data, how to generate .bin files for our article?

I have clear idea about tokenozation but what about the urls mapping? How to do it?

dondon2475848 commented 6 years ago

Hi @Sharathnasa You can clone below repository: https://github.com/dondon2475848/make_datafiles_for_pgn Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

ARNABKUMARPAN commented 4 years ago

check subprocess.call(command) set classpath using os.environ["CLASSPATH"]='stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar', then run