abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer #28

Open TianlinZhang668 opened 5 years ago

TianlinZhang668 commented 5 years ago

i run makedatafiles.py. but it has an error: Preparing to tokenize /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized... Making list of files to tokenize... Tokenizing 92579 files in /home/ztl/Downloads/cnn_stories/cnn/stories and saving in cnn_stories_tokenized... Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.process.PTBTokenizer Stanford CoreNLP Tokenizer has finished. Traceback (most recent call last):

However i can run echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer in the root i dont know how to deal with? thanks a lot

TianlinZhang668 commented 5 years ago

i run the corenlp-3.9.2.jar

ubaidsworld commented 5 years ago

You need stanford-corenlp-3.7.0.jar. See this: https://github.com/abisee/cnn-dailymail#2-download-stanford-corenlp Please read the README.md file.

TianlinZhang668 commented 5 years ago

Successfully finished tokenizing /home/ztl/Downloads/cnn_stories/cnn/stories to cnn_stories_tokenized.

Making bin file for URLs listed in url_lists/all_test.txt... Traceback (most recent call last): File "make_datafiles.py", line 239, in write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin")) File "make_datafiles.py", line 154, in write_to_bin url_hashes = get_url_hashes(url_list) File "make_datafiles.py", line 106, in get_url_hashes return [hashhex(url) for url in url_list] File "make_datafiles.py", line 106, in return [hashhex(url) for url in url_list] File "make_datafiles.py", line 101, in hashhex h.update(s) TypeError: Unicode-objects must be encoded before hashing

i have got the tokenized, but next ....

JafferWilson commented 5 years ago

Try this: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail Guess it will solve your tokenization and rest other issues.

quanghuynguyen1902 commented 5 years ago

if I have content of the article that isn't the same as structure of the CNN's article

JafferWilson commented 5 years ago

@quanghuynguyen1902 Guess you already have opened a new issue https://github.com/abisee/cnn-dailymail/issues/29 Lets go there. Please someone close this issue.

mooncrater31 commented 4 years ago

I am facing the same issue in here.

SpaceTime1999 commented 3 years ago

source ./.bash_profile