bheinzerling / pyrouge

A Python wrapper for the ROUGE summarization evaluation package
MIT License
250 stars 71 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 947: invalid continuation byte #11

Closed 85405115 closed 7 years ago

85405115 commented 7 years ago

i run these commands, successfully:

git clone https://github.com/bheinzerling/pyrouge
cd pyrouge
python setup.py install
pyrouge_set_rouge_path /absolute/path/to/ROUGE-1.5.5/directory
python -m pyrouge.test

and in the last command, i receive this:

Ran 11 tests in 6.322s OK

then i run this piece of code:

from pyrouge import Rouge155
r = Rouge155()
r.system_dir = "/home/afsharizadeh/Desktop/summarization/summarization_dataset/DUC_2007/2007/all_sum/system_sum/"
r.model_dir = "/home/afsharizadeh/Desktop/summarization/summarization_dataset/DUC_2007/2007/all_sum/ref_sum/"
r.system_filename_pattern = 'sum.(\d+).txt'
r.model_filename_pattern = 'sum.[A-Z].#ID#.txt'

output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

then i get this error:

2017-08-27 17:22:12,119 [MainThread  ] [INFO ]  Writing summaries.
2017-08-27 17:22:12,121 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmp192ti5r6/system and model files to /tmp/tmp192ti5r6/model.
2017-08-27 17:22:12,123 [MainThread  ] [INFO ]  Processing files in /home/afsharizadeh/Desktop/summarization/summarization_dataset/DUC_2007/2007/all_sum/system_sum/.
2017-08-27 17:22:12,125 [MainThread  ] [INFO ]  Processing sum.033.txt.
2017-08-27 17:22:12,128 [MainThread  ] [INFO ]  Processing sum.001.txt.
2017-08-27 17:22:12,130 [MainThread  ] [INFO ]  Processing sum.026.txt.
2017-08-27 17:22:12,132 [MainThread  ] [INFO ]  Processing sum.036.txt.
2017-08-27 17:22:12,135 [MainThread  ] [INFO ]  Processing sum.042.txt.
2017-08-27 17:22:12,137 [MainThread  ] [INFO ]  Processing sum.029.txt.
2017-08-27 17:22:12,139 [MainThread  ] [INFO ]  Processing sum.021.txt.
2017-08-27 17:22:12,141 [MainThread  ] [INFO ]  Processing sum.022.txt.
2017-08-27 17:22:12,144 [MainThread  ] [INFO ]  Processing sum.008.txt.
2017-08-27 17:22:12,146 [MainThread  ] [INFO ]  Processing sum.005.txt.
2017-08-27 17:22:12,150 [MainThread  ] [INFO ]  Processing sum.003.txt.
2017-08-27 17:22:12,152 [MainThread  ] [INFO ]  Processing sum.004.txt.
2017-08-27 17:22:12,155 [MainThread  ] [INFO ]  Processing sum.037.txt.
2017-08-27 17:22:12,159 [MainThread  ] [INFO ]  Processing sum.009.txt.
2017-08-27 17:22:12,162 [MainThread  ] [INFO ]  Processing sum.010.txt.
2017-08-27 17:22:12,165 [MainThread  ] [INFO ]  Processing sum.031.txt.
2017-08-27 17:22:12,168 [MainThread  ] [INFO ]  Processing sum.032.txt.
2017-08-27 17:22:12,171 [MainThread  ] [INFO ]  Processing sum.034.txt.
2017-08-27 17:22:12,174 [MainThread  ] [INFO ]  Processing sum.038.txt.
2017-08-27 17:22:12,177 [MainThread  ] [INFO ]  Processing sum.044.txt.
2017-08-27 17:22:12,179 [MainThread  ] [INFO ]  Processing sum.023.txt.
2017-08-27 17:22:12,182 [MainThread  ] [INFO ]  Processing sum.043.txt.
2017-08-27 17:22:12,185 [MainThread  ] [INFO ]  Processing sum.045.txt.
2017-08-27 17:22:12,187 [MainThread  ] [INFO ]  Processing sum.014.txt.
2017-08-27 17:22:12,190 [MainThread  ] [INFO ]  Processing sum.017.txt.
2017-08-27 17:22:12,193 [MainThread  ] [INFO ]  Processing sum.040.txt.
2017-08-27 17:22:12,195 [MainThread  ] [INFO ]  Processing sum.027.txt.
2017-08-27 17:22:12,198 [MainThread  ] [INFO ]  Processing sum.015.txt.
2017-08-27 17:22:12,200 [MainThread  ] [INFO ]  Processing sum.041.txt.
2017-08-27 17:22:12,203 [MainThread  ] [INFO ]  Processing sum.002.txt.
2017-08-27 17:22:12,205 [MainThread  ] [INFO ]  Processing sum.028.txt.
2017-08-27 17:22:12,208 [MainThread  ] [INFO ]  Processing sum.012.txt.
2017-08-27 17:22:12,210 [MainThread  ] [INFO ]  Processing sum.020.txt.
2017-08-27 17:22:12,213 [MainThread  ] [INFO ]  Processing sum.025.txt.
2017-08-27 17:22:12,216 [MainThread  ] [INFO ]  Processing sum.024.txt.
2017-08-27 17:22:12,218 [MainThread  ] [INFO ]  Processing sum.035.txt.
2017-08-27 17:22:12,221 [MainThread  ] [INFO ]  Processing sum.030.txt.
2017-08-27 17:22:12,224 [MainThread  ] [INFO ]  Processing sum.039.txt.
2017-08-27 17:22:12,226 [MainThread  ] [INFO ]  Processing sum.019.txt.
2017-08-27 17:22:12,229 [MainThread  ] [INFO ]  Processing sum.016.txt.
2017-08-27 17:22:12,232 [MainThread  ] [INFO ]  Processing sum.007.txt.
2017-08-27 17:22:12,234 [MainThread  ] [INFO ]  Processing sum.006.txt.
2017-08-27 17:22:12,237 [MainThread  ] [INFO ]  Processing sum.018.txt.
2017-08-27 17:22:12,241 [MainThread  ] [INFO ]  Processing sum.013.txt.
2017-08-27 17:22:12,243 [MainThread  ] [INFO ]  Processing sum.011.txt.
2017-08-27 17:22:12,246 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmp192ti5r6/system.
2017-08-27 17:22:12,248 [MainThread  ] [INFO ]  Processing files in /home/afsharizadeh/Desktop/summarization/summarization_dataset/DUC_2007/2007/all_sum/ref_sum/.
2017-08-27 17:22:12,251 [MainThread  ] [INFO ]  Processing sum.C.027.txt.
2017-08-27 17:22:12,253 [MainThread  ] [INFO ]  Processing sum.D.014.txt.
2017-08-27 17:22:12,256 [MainThread  ] [INFO ]  Processing sum.C.018.txt.
2017-08-27 17:22:12,258 [MainThread  ] [INFO ]  Processing sum.B.005.txt.
2017-08-27 17:22:12,260 [MainThread  ] [INFO ]  Processing sum.A.019.txt.
2017-08-27 17:22:12,263 [MainThread  ] [INFO ]  Processing sum.B.004.txt.
2017-08-27 17:22:12,265 [MainThread  ] [INFO ]  Processing sum.A.007.txt.
2017-08-27 17:22:12,268 [MainThread  ] [INFO ]  Processing sum.C.008.txt.
2017-08-27 17:22:12,270 [MainThread  ] [INFO ]  Processing sum.A.013.txt.
2017-08-27 17:22:12,272 [MainThread  ] [INFO ]  Processing sum.A.003.txt.
2017-08-27 17:22:12,277 [MainThread  ] [INFO ]  Processing sum.C.020.txt.
2017-08-27 17:22:12,280 [MainThread  ] [INFO ]  Processing sum.A.021.txt.
2017-08-27 17:22:12,283 [MainThread  ] [INFO ]  Processing sum.D.012.txt.
2017-08-27 17:22:12,286 [MainThread  ] [INFO ]  Processing sum.C.019.txt.
2017-08-27 17:22:12,289 [MainThread  ] [INFO ]  Processing sum.A.033.txt.
2017-08-27 17:22:12,291 [MainThread  ] [INFO ]  Processing sum.C.042.txt.
2017-08-27 17:22:12,294 [MainThread  ] [INFO ]  Processing sum.B.032.txt.
2017-08-27 17:22:12,297 [MainThread  ] [INFO ]  Processing sum.C.029.txt.
2017-08-27 17:22:12,299 [MainThread  ] [INFO ]  Processing sum.D.002.txt.
2017-08-27 17:22:12,302 [MainThread  ] [INFO ]  Processing sum.A.045.txt.
2017-08-27 17:22:12,304 [MainThread  ] [INFO ]  Processing sum.B.001.txt.
2017-08-27 17:22:12,306 [MainThread  ] [INFO ]  Processing sum.D.037.txt.
2017-08-27 17:22:12,309 [MainThread  ] [INFO ]  Processing sum.A.004.txt.
2017-08-27 17:22:12,311 [MainThread  ] [INFO ]  Processing sum.B.014.txt.
2017-08-27 17:22:12,314 [MainThread  ] [INFO ]  Processing sum.C.026.txt.
2017-08-27 17:22:12,317 [MainThread  ] [INFO ]  Processing sum.A.031.txt.
2017-08-27 17:22:12,319 [MainThread  ] [INFO ]  Processing sum.D.005.txt.
2017-08-27 17:22:12,322 [MainThread  ] [INFO ]  Processing sum.B.038.txt.
2017-08-27 17:22:12,329 [MainThread  ] [INFO ]  Processing sum.B.027.txt.
2017-08-27 17:22:12,332 [MainThread  ] [INFO ]  Processing sum.C.010.txt.
2017-08-27 17:22:12,335 [MainThread  ] [INFO ]  Processing sum.B.041.txt.
2017-08-27 17:22:12,338 [MainThread  ] [INFO ]  Processing sum.C.030.txt.
2017-08-27 17:22:12,341 [MainThread  ] [INFO ]  Processing sum.B.007.txt.
2017-08-27 17:22:12,343 [MainThread  ] [INFO ]  Processing sum.C.023.txt.
2017-08-27 17:22:12,346 [MainThread  ] [INFO ]  Processing sum.C.002.txt.
2017-08-27 17:22:12,349 [MainThread  ] [INFO ]  Processing sum.B.033.txt.
2017-08-27 17:22:12,351 [MainThread  ] [INFO ]  Processing sum.D.023.txt.
2017-08-27 17:22:12,354 [MainThread  ] [INFO ]  Processing sum.C.014.txt.
2017-08-27 17:22:12,356 [MainThread  ] [INFO ]  Processing sum.D.007.txt.
2017-08-27 17:22:12,359 [MainThread  ] [INFO ]  Processing sum.D.008.txt.
2017-08-27 17:22:12,363 [MainThread  ] [INFO ]  Processing sum.D.032.txt.
2017-08-27 17:22:12,366 [MainThread  ] [INFO ]  Processing sum.C.005.txt.
2017-08-27 17:22:12,368 [MainThread  ] [INFO ]  Processing sum.B.023.txt.
2017-08-27 17:22:12,371 [MainThread  ] [INFO ]  Processing sum.B.035.txt.
2017-08-27 17:22:12,374 [MainThread  ] [INFO ]  Processing sum.A.016.txt.
2017-08-27 17:22:12,377 [MainThread  ] [INFO ]  Processing sum.D.001.txt.
2017-08-27 17:22:12,380 [MainThread  ] [INFO ]  Processing sum.C.017.txt.
2017-08-27 17:22:12,382 [MainThread  ] [INFO ]  Processing sum.A.010.txt.
2017-08-27 17:22:12,385 [MainThread  ] [INFO ]  Processing sum.B.021.txt.
2017-08-27 17:22:12,387 [MainThread  ] [INFO ]  Processing sum.B.010.txt.
2017-08-27 17:22:12,391 [MainThread  ] [INFO ]  Processing sum.A.009.txt.
2017-08-27 17:22:12,394 [MainThread  ] [INFO ]  Processing sum.A.011.txt.
2017-08-27 17:22:12,396 [MainThread  ] [INFO ]  Processing sum.D.045.txt.
2017-08-27 17:22:12,399 [MainThread  ] [INFO ]  Processing sum.A.018.txt.
2017-08-27 17:22:12,401 [MainThread  ] [INFO ]  Processing sum.D.028.txt.
2017-08-27 17:22:12,403 [MainThread  ] [INFO ]  Processing sum.D.020.txt.
2017-08-27 17:22:12,406 [MainThread  ] [INFO ]  Processing sum.C.031.txt.
2017-08-27 17:22:12,408 [MainThread  ] [INFO ]  Processing sum.B.025.txt.
2017-08-27 17:22:12,411 [MainThread  ] [INFO ]  Processing sum.C.028.txt.

2017-08-27 17:22:12,414 [MainThread  ] [INFO ]  Processing sum.D.009.txt.
2017-08-27 17:22:12,417 [MainThread  ] [INFO ]  Processing sum.C.013.txt.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-b3bc5a66e7f0> in <module>()
      6 r.model_filename_pattern = 'sum.[A-Z].#ID#.txt'
      7 
----> 8 output = r.convert_and_evaluate()
      9 print(output)
     10 output_dict = r.output_to_dict(output)

/home/afsharizadeh/anaconda3/lib/python3.6/site-packages/pyrouge/Rouge155.py in convert_and_evaluate(self, system_id, split_sentences, rouge_args)
    358         if split_sentences:
    359             self.split_sentences()
--> 360         self.__write_summaries()
    361         rouge_output = self.evaluate(system_id, rouge_args)
    362         return rouge_output

/home/afsharizadeh/anaconda3/lib/python3.6/site-packages/pyrouge/Rouge155.py in __write_summaries(self)
    487     def __write_summaries(self):
    488         self.log.info("Writing summaries.")
--> 489         self.__process_summaries(self.convert_summaries_to_rouge_format)
    490 
    491     @staticmethod

/home/afsharizadeh/anaconda3/lib/python3.6/site-packages/pyrouge/Rouge155.py in __process_summaries(self, process_func)
    481             "model files to {}.".format(new_system_dir, new_model_dir))
    482         process_func(self._system_dir, new_system_dir)
--> 483         process_func(self._model_dir, new_model_dir)
    484         self._system_dir = new_system_dir
    485         self._model_dir = new_model_dir

/home/afsharizadeh/anaconda3/lib/python3.6/site-packages/pyrouge/Rouge155.py in convert_summaries_to_rouge_format(input_dir, output_dir)
    200         """
    201         DirectoryProcessor.process(
--> 202             input_dir, output_dir, Rouge155.convert_text_to_rouge_format)
    203 
    204     @staticmethod

/home/afsharizadeh/anaconda3/lib/python3.6/site-packages/pyrouge/utils/file_utils.py in process(input_dir, output_dir, function)
     27             input_file = os.path.join(input_dir, input_file_name)
     28             with codecs.open(input_file, "r", encoding="UTF-8") as f:
---> 29                 input_string = f.read()
     30             output_string = function(input_string)
     31             output_file = os.path.join(output_dir, input_file_name)

/home/afsharizadeh/anaconda3/lib/python3.6/codecs.py in read(self, size)
    696     def read(self, size=-1):
    697 
--> 698         return self.reader.read(size)
    699 
    700     def readline(self, size=None):

/home/afsharizadeh/anaconda3/lib/python3.6/codecs.py in read(self, size, chars, firstline)
    499                 break
    500             try:
--> 501                 newchars, decodedbytes = self.decode(data, self.errors)
    502             except UnicodeDecodeError as exc:
    503                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 947: invalid continuation byte

what is wrong?

bheinzerling commented 7 years ago

pyrouge is trying to open your summary files and assumes they are encoded with UTF8, but they are in a different encoding. The easiest way to solve this issue is to save your files in UTF-8 encoding before running pyrouge. Alternatively, if for some reason, you don't want to do that, you could also modify the pyrouge source code, by replacing codecs.open("path/to/summary/file", encoding="utf8") with codecs.open("path/to/summary/file", encoding="YOUR_ENCODING")

85405115 commented 7 years ago

I saved my files into UTF-8 encoding and the problem was solved. thanks.