fleetingbytes / rtfparse

RTF Parser
MIT License
12 stars 7 forks source link

Infinite Loop Issue #18

Closed ninoseki closed 6 months ago

ninoseki commented 6 months ago

Hello, first of all, thank you for creating this library.

I found an infinite loop issue. So let me report it.

When taking this file as an input,

Rtf_Parser("/path/to/file").parse_file()

starts infinite looping like

Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
...
fleetingbytes commented 6 months ago

Hello @ninoseki, thank you for using rtfparse and for this bug report. Reading the log file in ~/rtfparse/rtfparse.debug.log I found that the error occurs while reading the control word htmlrtf on line 622 (Rich Text Format (RTF) Specification, Version 1.9.1 defines Control Word on page 7). The source file on line 622 has the byte sequence {\htmlrtf0Start with $200 credit. This is valid RTF. Rtfparse cannot find the end of this control word here.

The control word "htmlrtf" has a one-digit parameter "0", and the parameter is delimited by a character other that an ASCII digit, here "S". This "S" marks that the control word has ended at the previous byte. This is not correctly recognized by rtfparse. I will fix this.

Meanwhile, a workaround for your document would be to add a space between the 0 and the S: {\htmlrtf0 Start with $200 credit.

fleetingbytes commented 6 months ago

Note to self: Potential fix could be: in re_patterns.py:

nothing = named_regex_group("nothing", group(rb""))
...
delimiter = named_regex_group("delimiter", rb"|".join((space, newline, other, nothing, rb"$")))

Needs testing.

fleetingbytes commented 6 months ago

@ninoseki Try the new rtfparse 0.9.0 (it's on PyPI). The issue should be fixed there. If you used rtfparse programmatically, please note that some things in the API were renamed. If you only executed rtfparse from the CLI, not much has changed, except that it uses --decapsulate-html instead of --de-encapsulate-html.

ninoseki commented 6 months ago

Thanks!