abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

Naming convention in tokenized dir #12

Closed ibarrien closed 7 years ago

ibarrien commented 7 years ago

Question: What should output filenames look like resulting from tokenize_stories()? It seems like write_to_bin() expects hashed names in this directory, which I'm not producing directly from tokenize_stories() (i.e. from PTBTokenizer).

Context: On Mac OS, using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar.

Example: If one of the "stories" input file names is "A" then after executing tokenize_stories(), a file called "A" appears in the corresponding tokenized_stories_dir, as opposed to hashhex("A").

It seems PTBTokenizer is working (at least partially) since the tokenized "A" does have, for example, spaces between punctuation marks and -LRB- for left parenthesis.

Outlook: Specifically, in write_to_bin(), there is story_fnames = [s+".story" for s in url_hashes] However, if hashed names are not produced by tokenize_stories(), then a "fix" is story_fnames = [s + ".story" for s in url_list]

JafferWilson commented 7 years ago

Are you using the Standford 2016 Parser or 2017 Parser? If you want to know how the files look like after tokenization, you can check this repository: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail You can check the links and download the required data for training. Let me know if there is any issue.

ibarrien commented 7 years ago

Hello. The 2016 parsers, specifically from the "stanford-core-nlp-full-2016-10-31" package. PTBTokenizer does not output hexxed filenames, nor does it seem it's intended to. Rather, it outputs filenames exactly as provided by the argument provided to the option -ioFileList. E.g. in the following lines from "tokenize_stories()" there is no hexxing; hence, the fix I mentioned above is sufficient in this case.

  stories = os.listdir(stories_dir)  # the filenames in stories_dir are not assumed to be hexxed
  # make IO list file
  print "Making list of files to tokenize..."
  with open("mapping.txt", "w") as f:
    for s in stories:
      f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))