exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

Piping to file on windows produces charmap errors #88

Closed kieranjol closed 1 year ago

kieranjol commented 2 years ago

If my input contains non-ascii characters, i can't pipe to a file on windows. I was able to get around this by adding .encode(utf-8) to this line: https://github.com/exponential-decay/demystify/blob/main/demystify.py#L131 , but this also adds newline characters to each line in the output. So is there a way to force utf-8 but not have those newlines? One thing I could think of is to add an output option to demystify and perhaps just do the traditional

with open('OUTPUT.HTML' , encoding='utf8') as fo:
    fo.write(LOVELY_HTML)
>python demystify.py  --export C:\Users\k\yaml.yaml > out.html
2022-05-31 11:33:40 INFO: demystify.py:170:analysis_from_csv(): Generating database from input report...
2022-05-31 11:33:43 INFO: demystify.py:172:analysis_from_csv(): Database path: C:\Users\k\yaml.db
2022-05-31 11:33:43 INFO: demystify.py:146:analysis_from_database(): Analysis from database: C:\Users\k\yaml.db
2022-05-31 11:33:43 INFO: DemystifyAnalysisClass.py:1090:runanalysis(): Running analysis, Rogues or Heroes: False
2022-05-31 11:33:43 INFO: DemystifyAnalysisClass.py:781:queryDB(): Querying database
2022-05-31 11:33:43 INFO: DemystifyAnalysisClass.py:784:queryDB(): Unable to detect duplicates: No HASH algorithm used by identification tool
2022-05-31 11:33:45 INFO: demystify.py:128:handle_output(): Outputting HTML report
Traceback (most recent call last):
  File "demystify.py", line 250, in <module>
    main()
  File "demystify.py", line 244, in main
    handle_output(analysis.analysis_results, args.txt, args.rogues, args.heroes)
  File "demystify.py", line 131, in handle_output
    print(htmloutput.printHTMLResults())
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 298-299: character maps to <undefined>
ross-spencer commented 2 years ago

🤔 thanks Kieran - this is a big thing I've been trying to get my head around with the Py3 work. I'll take a look at these options and let you know. Can you share with me part of the YAML here at all so I can add it to the unit tests or see why they're not picking this up?

kieranjol commented 2 years ago

So sorry about the delay! I've found a snippet that breaks both HTML and TXT. Attaching the yaml as a zip here to in case the copypaste changes the encoding or something. This breaks even if I don't pipe to a file. siegfried.zip

---
siegfried   : 0.0.0
scandate    : 0001-01-01T00:00:00Z
signature   : 
created     : 0001-01-01T00:00:00Z
identifiers : 
  - name    : 'pronom'
    details : ''
---
filename : '/media/sequence1.iMovieProject/Media/._Icon'
filesize : 55430
modified : 2008-05-05T11:33:54Z
errors   : 
md5      : 6132356339393233393662353731353931663431346530663631636161376437
matches  :
  - ns      : 'pronom'
    id      : 'fmt/503'
    format  : 'AppleDouble Resource Fork'
    version : '2'
    mime    : 'multipart/appledouble'
    basis   : 'byte match at 0, 8'
    warning : 

Full terminal output

:\Users\koleary\Downloads>python C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py -txt -export C:\Users\koleary\Desktop\siegfried.yaml
usage: demystify.py [-h] [--export EXPORT] [--db DB] [--txt] [--denylist] [--rogues] [--heroes] [--denylist_template]
demystify.py: error: unrecognized arguments: -txt -export C:\Users\koleary\Desktop\siegfried.yaml

C:\Users\koleary\Downloads>python C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py --txt --export C:\Users\koleary\Desktop\siegfried.yaml
2022-08-10 10:22:58 INFO: demystify.py:170:analysis_from_csv(): Generating database from input report...
Traceback (most recent call last):
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 14, in <module>
    main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 10, in main
    demystify.main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 255, in main
    analysis = analysis_from_csv(
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 171, in analysis_from_csv
    database_path = sqlitefid.identify_and_process_input(format_report)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\sqlitefid.py", line 46, in identify_and_process_input
    type_ = id_.exportid(export)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\libs\IdentifyExportClass.py", line 72, in exportid
    droid_magic = f.readline()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 226: character maps to <undefined>

C:\Users\koleary\Downloads>python C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py --txt --export C:\Users\koleary\Desktop\siegfried.yaml
2022-08-10 10:23:02 INFO: demystify.py:170:analysis_from_csv(): Generating database from input report...
Traceback (most recent call last):
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 14, in <module>
    main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 10, in main
    demystify.main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 255, in main
    analysis = analysis_from_csv(
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 171, in analysis_from_csv
    database_path = sqlitefid.identify_and_process_input(format_report)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\sqlitefid.py", line 46, in identify_and_process_input
    type_ = id_.exportid(export)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\libs\IdentifyExportClass.py", line 72, in exportid
    droid_magic = f.readline()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 226: character maps to <undefined>

C:\Users\koleary\Downloads>python C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py  --export C:\Users\koleary\Desktop\siegfried.yaml
2022-08-10 10:23:15 INFO: demystify.py:170:analysis_from_csv(): Generating database from input report...
Traceback (most recent call last):
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 14, in <module>
    main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\demystify.py", line 10, in main
    demystify.main()
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 255, in main
    analysis = analysis_from_csv(
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\demystify.py", line 171, in analysis_from_csv
    database_path = sqlitefid.identify_and_process_input(format_report)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\sqlitefid.py", line 46, in identify_and_process_input
    type_ = id_.exportid(export)
  File "C:\Users\koleary\Downloads\demystify-v2.0.0rc1.tar\demystify-v0.0.0\demystify\src\demystify\sqlitefid\src\sqlitefid\libs\IdentifyExportClass.py", line 72, in exportid
    droid_magic = f.readline()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 226: character maps to <undefined>
ross-spencer commented 2 years ago

Ack! Yep, @kieranjol I could have and should have picked up on this. Tests cover this - BUT - the tests haven't been run on Windows 😢 I am seeing this today. I don't know if I have a fix readily available but will see.

ross-spencer commented 2 years ago

Hi again @kieranjol - not sure if I have fixed this for your use case but there are some commits against this in the linked pull requests. They fix some Windows specific issues found while developing on the platform today.

I would add, on your end, this change may help: https://stackoverflow.com/a/52617143

I'd like to test the original issue but it looks like the attached YAML suffered mojibake on the command line and so didn't come through correctly. I do seem to be working on the same set of files today though: https://github.com/exponential-decay/demystify/issues/95 - and there are some fixes around that, which, will, if nothing else, make it work on demystify-lite which is one way of going about this.

Not closing this issue as I'd still like to get to the bottom of this. Will implement Windows testing in CI.

kieranjol commented 1 year ago

I tried out git master to no avail in Windows CMD, same errors - HOWEVER your stackoverflow post fixed the issue. I was able to produce html and text via piping with git master by running chcp 65001 & set PYTHONIOENCODING=utf-8 before the command. I would imagine that this might fix a whole bunch of other issues too. Thank you Ross, demystify is amazing!

ross-spencer commented 1 year ago

This commit adds some logging that will help users understand what their system is reporting and provide information on the workaround if it happens to them: https://github.com/exponential-decay/demystify/commit/c80d5f2bd4d64e6189b04ecb025ab72d77c429dd