abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

make_datafiles.py issue #34

Open zixiliuUSC opened 4 years ago

zixiliuUSC commented 4 years ago

I run make_datafiles.py to generate raw text file for BART preprocessing, but I meet following issue:

python make_datafiles.py ./cnn/stories ./dailymail/stories/ Making bin file for URLs listed in url_lists/all_test.txt... Traceback (most recent call last): File "make_datafiles.py", line 138, in write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test")) File "make_datafiles.py", line 84, in write_to_bin url_list = read_text_file(url_file) File "make_datafiles.py", line 26, in read_text_file with open(text_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: 'url_lists/all_test.txt'

Then I assume it is because all_test_urls doesn't direct to the url file in the dataset, i.e., wayback_test_urls.txt. So, I alter the file name to all_test.txt and put it in the folder, ./cnn/url_lists . But the code still gives the same error. So, I check the source again and find something wrong in the following line. url_list = read_text_file(url_file) And I alter it to be: url_list = read_text_file(os.path.join('./cnn', url_file)) In this way, I think all the source and target file is generated from only cnn dataset. Am I right?