blingenf / copydetect

Code plagiarism detection tool
MIT License
238 stars 36 forks source link

UnicodeEncodeError in report generation #30

Closed omerchor closed 1 year ago

omerchor commented 1 year ago

I'm getting the following traceback after "Code comparison completed":

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\__main__.py", line 120, in <module>
    main()
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\__main__.py", line 117, in main
    detector.generate_html_report()
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\detector.py", line 668, in generate_html_report
    report_f.write(output)
  File "C:\Program Files\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112029-112032: character maps to <undefined>

I believe this has to do with some of the code (strings) containing non-Latin characters, because when I remove these strings the report is generated as expected.

blingenf commented 1 year ago

Did you install from pip or from github? I just merged in a PR that I think might fix this (#29), but it hasn't been included in a release uploaded to pypi yet.

Windows apparently does not always use UTF-8 as the default locale, so it has to be specified explicitly for file I/O operations. The current version of copydetect on pypi doesn't do this when writing the output report, which I think is what's causing this crash.

omerchor commented 1 year ago

Installed from PyPi, it seems that the commit should indeed fix the problem. Thanks!

blingenf commented 1 year ago

The fix has now been uploaded to PyPI as copydetect==0.4.4. Let me know if you are still seeing the issue with the latest version.