Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.99k stars 578 forks source link

Fix re raw string changes in sigil_bs4/dammit.py #756

Closed mihailim closed 6 months ago

mihailim commented 6 months ago

Commit 434a407 solves the SyntaxWarnings in sigil_bs4/dammit.py, however the fix accidentally introduces semantic changes in two of the regexes on line 50 and 52. The original non-raw-strings used single quotes, and there were escaped single quotes inside the expressions. Turning them into raw strings by simply prefixing them with r doesn't work in this case, because the backslash character is then taken literally, rendering the corresponding regex non-functional for its intended purpuse. Instead, we must use raw triple-quoted strings to ensure the regex semantics stay the same as in the original rendering.

Concretely, the situation looks like this:

Under Python 2:

Original, before 434a407:

>>> '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> '<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Version from 434a407 (doubles backslashes):

>>> r'^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
'^<\\?.*encoding=[\\\'"](.*?)[\\\'"].*\\?>'
>>> r'<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
'<\\s*meta[^>]+charset\\s*=\\s*["\\\']?([^>]*?)[ /;\\\'">]'

Version from this fix (restores original semantics):

>>> r'''^<\?.*encoding=['"](.*?)['"].*\?>'''
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> r'''<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]'''
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Under Python 3.9:

Original, before 434a407:

>>> '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> '<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Version from 434a407 (doubles backslashes):

>>> r'^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
'^<\\?.*encoding=[\\\'"](.*?)[\\\'"].*\\?>'
>>> r'<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
'<\\s*meta[^>]+charset\\s*=\\s*["\\\']?([^>]*?)[ /;\\\'">]'

Version from this fix (restores original semantics):

>>> r'''^<\?.*encoding=['"](.*?)['"].*\?>'''
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> r'''<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]'''
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Under Python 3.12:

Original, before 434a407 (obviously with SyntaxWarnings):

>>> '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
<stdin>:1: SyntaxWarning: invalid escape sequence '\?'
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> '<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
<stdin>:1: SyntaxWarning: invalid escape sequence '\s'
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Version from 434a407 (doubles backslashes):

>>> r'^<\?.*encoding=[\'"](.*?)[\'"].*\?>'
'^<\\?.*encoding=[\\\'"](.*?)[\\\'"].*\\?>'
>>> r'<\s*meta[^>]+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'
'<\\s*meta[^>]+charset\\s*=\\s*["\\\']?([^>]*?)[ /;\\\'">]'

Version from this fix (restores original semantics without SyntaxWarnings):

>>> r'''^<\?.*encoding=['"](.*?)['"].*\?>'''
'^<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'
>>> r'''<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]'''
'<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

Have checked that this doesn't change the runtime values of the resulting strings under Python versions 2.7, 3.9, 3.11, and of course 3.12.

Please note that I have NOT touched sigil_custom_changes_to_bs4-4.4.0.patch.txt since I'm not 100% on whether that's still up to date, how it was generated, and so on. Depending on our policy on how to handle that file it might be necessary to regenerate it to include the changes.

Aside: Accidentally double-checked that all the fixes in 434a407 also didn't change the semantics values for the other element.py and dammit.py changes: Initially I was accidentally working off 2.1.0-release, fixed the SyntaxWarning issues in both files, and only found out about the existence of 434a407 when I noticed that git wasn't detecting any changes in element.py. In other words, we independently performed the exact same changes for those :)

kevinhendricks commented 6 months ago

Sounds good, and your explanation of how I messed up those two lines makes sense.

Thank you!

kevinhendricks commented 6 months ago

The patch file is old so no worries. It was based off pre-python3 version of bs4. Sigil has long moved to python3 only. There are still remnants of python2 support lying around but these are not used. The patch is also no longer used. I will remove it.

Thanks again.

mihailim commented 6 months ago

Thank you, much obliged! I wasn't 100% sure whether we weren't still keeping py2 compat somewhere somehow, it's nice to know I can let go of history as well :)