HairySpoon / htlfc

Hypertext Legacy File Converter
GNU Affero General Public License v3.0
1 stars 0 forks source link

I have hard time to understand where does cp1254 comes from #2

Closed mcepl closed 1 year ago

mcepl commented 1 year ago

With errata.maff (had to rename the file to .zip so that GH would accept it) I get this:

stitny~/a/2/c/C/original (master)$ htlfc -b errata.maff
Traceback (most recent call last):
  File "/home/matej/.bin/htlfc", line 8, in <module>
    sys.exit(run_htlfc())
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/__init__.py", line 6, in run_htlfc
    main.main()
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/main.py", line 118, in main
    source = loader.unpack(infile)
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/agents/loader.py", line 40, in unpack
    manifest.make(source)
  File "/home/matej/.local/lib/python3.10/site-packages/htlfc/merger/manifest.py", line 40, in make
    content = content.decode(enc).replace('&quot;','"').split()
  File "/usr/lib64/python3.10/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9e in position 2289: character maps to <undefined>
stitny~/a/2/c/C/original (master)$ 

I have hard time to understand where cp1254 (Turkish???) comes from, when all files are clearly labelled as being in windows-1250 codepage.

HairySpoon commented 1 year ago

Thanks for reporting this issue. Your sample file helped as I am able to reproduce the error.

I believe the fault originates at the previous line in manifest.py where chardet module attempts to detect the encoding. It returns "Windows-1254" (I see this with a debug print statement). The situation is discussed at: https://github.com/chardet/chardet/issues/148 without a satisfactory solution. Allow me a few days to develop a work around.

HairySpoon commented 1 year ago

Just released v0.3.0 on PyPi with my solution. Please evaluate and report.

mcepl commented 1 year ago

Yes, it works!

Thank you.