fleetingbytes / rtfparse

RTF Parser
MIT License
12 stars 10 forks source link

unknown encoding ansi #11

Closed user3472g closed 8 months ago

user3472g commented 1 year ago

rtfparse 0.8.0 fails immediately whilst attempting to parse valid RTF file

Usage:

pip3 install rtfparse rtfparse -r file.rtf


parsing the structure of file.rtf  
recognized encoding ansi  
unknown encoding: ansi  
Traceback (most recent call last):  
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/parser.py", line 88, in parse_file 
    self.parsed = entities.Group(encoding, file)  
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/entities.py", line 187, in __init__  
    self.structure.append(Control_Word(self.encoding, file))  
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/entities.py", line 78, in __init__  
    self.control_name = match.group("control_name").decode(self.encoding)  
LookupError: unknown encoding: ansi
Structure of file.rtf parsed
Uncaught exception AttributeError("'Rtf_Parser' object has no attribute 'parsed'") occurred.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/cli.py", line 128, in main
    run(cli_args)
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/cli.py", line 98, in run
    rp.parse_file()
  File "/home/user/.local/lib/python3.10/site-packages/rtfparse/parser.py", line 96, in parse_file
    return self.parsed
AttributeError: 'Rtf_Parser' object has no attribute 'parsed' `
fleetingbytes commented 1 year ago

Thank you for the report, I'll look into it

fleetingbytes commented 1 year ago

@user3472g rtfparse's error handling is at fault. The RTF you are trying to decode says it is using the ANSI encoding. The codec for this is only available on Python on a Windows platform. Your python is unable to find the ansi codec, so it threw the LookupError which is not handled by rtfparse.

I will publish a bugfix which will handle this Error properly, but rtfparse will still be unable to parse ansi-encoded RTFs on non-Windows platforms.

fleetingbytes commented 1 year ago

Wikipedia writes that "Windows-1252 is referred to as "ANSI" especially often" so I will try to interpret it as cp1252, I guess it's that instead. I have never seen an RTF to use MBCS (what python understands under "ansi")

fleetingbytes commented 1 year ago

@user3472g try the rtfparse 0.8.1, you can upgrade via pip