Sigil-Ebook / Sigil

Sigil is a multi-platform EPUB ebook editor
GNU General Public License v3.0
5.99k stars 578 forks source link

PCRE expression causes segmentation fault #627

Closed 0xf00fc7c8 closed 3 years ago

0xf00fc7c8 commented 3 years ago

When I use the find tool with the PCRE expression <ins.*class="adsbygoogle"(.|\n)*<\/ins>, and it finds a match, it causes a segmentation fault. After compiling with debug symbols and running a backtrace, I notice there seems to an infinite loop caused by the search and replace logic in 3rdparty/pcre/pcre_exec.c. I couldn't exactly tell what was wrong with the code, but I'm guessing the infinte loop caused a stack overflow, which caused a segfault? I added as much of the backtrace as I thought would be useful. I also uploaded a epub file I was editing (as a zip file). Disciple of Immortal.zip

#16360 0x0000555555a3a823 in match (eptr=<optimized out>, ecode=0x555557bc68a6, mstart=0x55555753479a, offset_top=4, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:2061
#16361 0x0000555555a323b9 in match (eptr=<optimized out>, ecode=0x555557bc6896, mstart=0x55555753479a, offset_top=4, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:983
#16362 0x0000555555a3a823 in match (eptr=<optimized out>, ecode=0x555557bc68a6, mstart=0x55555753479a, offset_top=4, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:2061
#16363 0x0000555555a323b9 in match (eptr=<optimized out>, ecode=0x555557bc6896, mstart=0x55555753479a, offset_top=2, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:983
#16364 0x0000555555a3349f in match (eptr=<optimized out>, ecode=0x555557bc6894, mstart=0x55555753479a, offset_top=2, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:1878
#16365 0x0000555555a362ea in match (eptr=<optimized out>, ecode=<optimized out>, mstart=0x55555753479a, offset_top=2, md=<optimized out>, eptrb=0x0, rdepth=<optimized out>)
    at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:5934
#16366 0x0000555555a3c83e in pcre16_exec (argument_re=0x555557bc67f0, extra_data=0x555558719750, subject=0x555557533f88, length=length@entry=18427, 
    start_offset=start_offset@entry=0, options=options@entry=0, offsets=0x5555586bea70, offsetcount=6) at /home/john/Code/Sigil/3rdparty/pcre/pcre_exec.c:6935
#16367 0x00005555559962b4 in SPCRE::getFirstMatchInfo (this=this@entry=0x55555748b460, text=...) at /usr/include/x86_64-linux-gnu/qt5/QtCore/qstring.h:1027
#16368 0x00005555559a9721 in CodeViewEditor::FindNext (this=0x55555bd2c270, search_regex=..., search_direction=Searchable::Direction_Down, misspelled_words=<optimized out>, 
    ignore_selection_offset=<optimized out>, wrap=<optimized out>, marked_text=false) at /home/john/Code/Sigil/src/ViewEditors/CodeViewEditor.cpp:856
#16369 0x0000555555a02449 in FindReplace::FindText (this=0x5555565c60f0, direction=Searchable::Direction_Down) at /home/john/Code/Sigil/src/MainUI/FindReplace.cpp:605
#16370 0x0000555555a029c5 in FindReplace::FindClicked (this=0x5555565c60f0) at /home/john/Code/Sigil/src/MainUI/FindReplace.cpp:268
kevinhendricks commented 3 years ago

With what version of Sigil on what platform? If Linux or Unix was an external pcre library version used when building Sigil. It lookslike the internal version was used by the backtrace but sometimes even backtrace source lines can be wrong.

kevinhendricks commented 3 years ago

Did you have either DOTALL or Minimal Match flags set? it appears your regex is trying to do its own equivalent of DOTALL by or with \n.

Does removing that and using the dotall flag with minimal match change anything?

You are trying to capture all text after the adsbygoogle class (even including the ins begin tag ending > up to the closing ins tag. Correct? After replace this will create not well-formed xhtml as you will remove the ending > of the ins tag.

0xf00fc7c8 commented 3 years ago

@kevinhendricks This version is on Ubuntu Linux from commit 02f9d41acd0f250afa7be5736a98e39ccbb34a90, which was HEAD when I cloned it. It's linked against multiple PCRE versions btw, not really sure why but I assume it's intended.

libpcre2-16.so.0 => /lib/x86_64-linux-gnu/libpcre2-16.so.0 (0x00007fbd72e33000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007fbd6b7db000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fbd6cc9c000)
kevinhendricks commented 3 years ago

I will try your testcase on my macOS machine just to verify it causes an infinite loop.

kevinhendricks commented 3 years ago

It should not be. It should be using the one internal to Sigil unless you tell it otherwise using build flags. BTW, Qt has its own pcre implementation but it is limited to the Qt side.

kevinhendricks commented 3 years ago

Let me see if I can recreate your crash on macOS just to rule out compiler/optimization bugs.

0xf00fc7c8 commented 3 years ago

Does removing that and using the dotall flag with minimal match change anything?

I wasn't really sure how to create an equivalent expression this way, but <ins.*class="adsbygoogle"* with DotAll set didn't cause an error

After replace this will create not well-formed xhtml as you will remove the ending > of the ins tag.

I don't think so? When I tested the expression on regexr.com, it captured all of this

<ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5693924549416081" data-ad-slot="7912930256">

      <iframe id="aswift_2" style="height: 1px !important; max-height: 1px !important; max-width: 1px !important; width: 1px !important;">&lt;iframe id="google_ads_frame1"&gt;</iframe>

      </ins>
kevinhendricks commented 3 years ago

That is the full match not the capture group in ( ).

kevinhendricks commented 3 years ago

Okay I can definitely recreate this on my macOS master build. You are correct I get an abort caused by a stack guard so it is an infinite loop or recursion that is causing the problem.

Our version of PCRE is quite old now so I will check to see if pcre1 has any reports of similar bugs and a potential solution.

As a workaround you should enable the DOTALL flag and Minimal Match and replace the (.|\n) with just a .

kevinhendricks commented 3 years ago

And you are right there is nothing in the () capture group, it was just for the or .|\n so my mistake.

kevinhendricks commented 3 years ago

I also seem to have issues with dotall set but not minimal match yet there is only one possible match in the entire test file. Very strange.

I am not up on the pcre (version 1) library code at all. So I have no good guesses yet. If you have access to a pcre (version 1 library) on your box, it is probably a bit newer than what is used in our 3rdparty internal version. You might try rebuilding and setting the build flag to use your system version of pcre and see if that fixes the issue. If it does, please let us know and I will try to backport the newer version back to Sigil's 3rd party.

Otherwise, I will try and see if I can track it down.

kevinhendricks commented 3 years ago

BTW, Thank you for your bug report and testcase. Sorry I do not have an immediate fix for this one but definitely try your system level pcre version 1 library in case it has already fixed this bug. Otherwise please use the workaround until a real fix comes.

0xf00fc7c8 commented 3 years ago

As a workaround you should enable the DOTALL flag and Minimal Match and replace the (.|\n)* with just a .

Your workaround worked, thank you.

kevinhendricks commented 3 years ago

As a note to self ... see https://www.pcre.org/original/doc/html/pcrestack.html

This indicates that some valid regexes can cause infinite recursion calls depending on exactly what sub patterns are used and it is not caused by a bug per se.

If this is the case and this specific regex search is an example of such, we may need to set recursion limits to prevent stack overflow allowing this to search to fail gracefully.

dougmassay commented 3 years ago

We do set a match recursion limit, but it seems a bit exorbitant.

https://github.com/Sigil-Ebook/Sigil/blob/master/3rdparty/cmake/pcre.cmake#L65

kevinhendricks commented 3 years ago

We should definitely set the recursion limit to be much lower. But hopefully all of this is just the result of a jit or pcre bug that has already been fixed in the final version of pcre which I think is version 8.45 available from pcre.org. If Arch still uses the older pcre lib hopefully it will be up to date so we can check to see if it helps.

kevinhendricks commented 3 years ago

Ourcurrent version appears to be 8.37 from April of 2015 via John.

dougmassay commented 3 years ago

Arch is on pcre 8.44 (flagged as out of date June 16 -- 8.45 is currently in testing), and I can confirm that the segfault still happens when Sigil is built using the system libs option. I shouldn't think it would be long before 8.45 is out of testing, but is there anything in the changelog between 8.44 and 8.45 that would suggest this sort of thing has been corrected?

Looks like Arch is applying some patches that are included in the pcre source,

# apply patch from the source array (should be a pacman feature)
local filename
for filename in "${source[@]}"; do
if [[ "$filename" =~ \.patch$ ]]; then
    msg2 "Applying patch ${filename##*/}"
    patch -p1 -N -i "$srcdir/${filename##*/}"
fi
done

but the configure command itself is pretty straightforward:

 ./configure  \
    --prefix=/usr \
    --enable-unicode-properties \
    --enable-pcre16 \
    --enable-pcre32 \
    --enable-jit \
    --enable-pcregrep-libz \
    --enable-pcregrep-libbz2 \
    --enable-pcretest-libreadline

Unless it's in one of those pcre.org-included patches, I don't see Arch setting any recursion limits.

dougmassay commented 3 years ago

I can also confirm the suggested workaround still works with pcre 8.44

kevinhendricks commented 3 years ago

I took a look and really saw nothing that might fix this in the tiny patch between 8.44 and 8.45. So my guess is it will not help. I think we should update to 8.45 in 3rdparty either way just to get what fixes we can for the future.

Each time a () is reached a recursive call is made. So I am wondering if we replaced the ( . | \n ) with [\n.] just to see what impact that has. I am just not sure why this particular regex is causing such problems. It really makes no sense to me as a hard pattern starts and ends it.

We will probably just have to change the recursion limit to see if that can prevent the abort/segfault.

Kevin

On Jun 28, 2021, at 8:37 AM, Doug Massay @.***> wrote:

 Arch is on pcre 8.44 (flagged as out of date June 16 -- 8.45 is currently in testing), and I can confirm that the segfault still happens when Sigil is built using the system libs option. I shouldn't think it would be long before 8.45 is out of testing, but is there anything in the changelog between 8.44 and 8.45 that would suggest this sort of thing has been corrected?

Looks like Arch is applying some patches that are included in the pcre source,

apply patch from the source array (should be a pacman feature)

local filename for filename in "${source[@]}"; do if [[ "$filename" =~ .patch$ ]]; then msg2 "Applying patch ${filename##/}" patch -p1 -N -i "$srcdir/${filename##/}" fi done but the configure command itself is pretty straightforward:

./configure \ --prefix=/usr \ --enable-unicode-properties \ --enable-pcre16 \ --enable-pcre32 \ --enable-jit \ --enable-pcregrep-libz \ --enable-pcregrep-libbz2 \ --enable-pcretest-libreadline Unless it's in one of those pcre.org-included patches, I don't see Arch setting any recursion limits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

eli-schwartz commented 3 years ago

There are no actual patches being applied by arch, that's unused boilerplate included, as the comments suggest, for political reasons.

If there were patches downloaded alongside the pcre tarball (not inside the extracted tarball) this code would do an automatic patching process similar to debian's autofoo magic.

dougmassay commented 3 years ago

Thanks. Does anyone know a quick/simple way we can take Sigil out of the equation just to confirm it's not our implementation?

kevinhendricks commented 3 years ago

There really is not that much intervening code. Some C++ wrapper pieces around Qt strings but nothing much. My guess is we are just seeing either a bug in pcre or one of those corner cases mentioned in the pcrestack link.

We could go to using the heap and not the stack according to the docs but this is supposed to be much slower and just changes the problem area and delays the inevitable.

Perhaps if pcre2 has fixed this issue, we might want to think about moving to pcre2 but there are differences in atomic /jit compile related changes that impact multithreaded environments (which is how we use pcre in many places). I remember this hitting Qt 5.12 and later when they moved to pcre2. I remember filing a Qt bug report showing one such problem.

kevinhendricks commented 3 years ago

How big are typical default stack limits for single threads under modern / Arch Linux? The pcrestack docs say to figure 500 bytes for each call frame (very roughly). If we know that and the stack limit per thread we can figure out a good value for the recursion limit (after dividing by 2 as a safety factor).

kevinhendricks commented 3 years ago

FWIW,

Instead of:

<ins.*class="adsbygoogle"(.|\n)*<\/ins>

other variations work without any recursion problems without DOTALL but must have Minimal Match set.

<ins.*class="adsbygoogle"(.|\n)+<\/ins>
<ins.*class="adsbygoogle"(.|\R)+<\/ins>

So infinite recursion is caused by not using MINIMAL MATCH.

Does anyone know if calibre uses pcre (version 1 library) or if they use python's built-in or other regex in their editor? If so, it would be interesting to see how they handle this test case.

dougmassay commented 3 years ago

Pretty-sure calibre exclusively uses Barnett's python regex module as a drop-in replacement for re

kevinhendricks commented 3 years ago

I just double-checked and even with the original regex and MINIMAL MATCH it works. So you do not need DOTALL set. Just MINIMAL match.

So somehow when searching for the largest matching set (default without minimal match) along with this pattern of wiildcards leads to the (near) infinite recursion and trouble.

kevinhendricks commented 3 years ago

Yes calibre's regex implementation handles it just fine both with minimal matching and not.

Using

<ins.*class="adsbygoogle"(.|\n)*<\/ins>

captures everything between and including the two matches.

whereas

<ins.*?class="adsbygoogle"(.|\n)*?<\/ins>

captures each instance separately both with no issues.

Might be time to explore pcre2.

kevinhendricks commented 3 years ago

FWIW, I made the following change:

--- pcre.cmake~ 2020-09-27 18:55:41.000000000 -0400
+++ pcre.cmake  2021-06-28 11:45:34.000000000 -0400
@@ -62,7 +62,8 @@
 SET(PCRE_LINK_SIZE "2")
 SET(PCRE_POSIX_MALLOC_THRESHOLD "10")
 SET(PCRE_MATCH_LIMIT "10000000")
-SET(PCRE_MATCH_LIMIT_RECURSION "MATCH_LIMIT")
+# SET(PCRE_MATCH_LIMIT_RECURSION "MATCH_LIMIT")
+SET(PCRE_MATCH_LIMIT_RECURSION "4000")
 SET(PCRE_PARENS_NEST_LIMIT "250")

 SET(SUPPORT_PCRE16 1)

Based on roughly 500 bytes of stack used for each call frame and a 2MB stack size on Linux and a 8MB stack size limit on macOS.

Now pcre no longer crashes and Sigil simply reports "no matches found".

Changing nothing but making the second count "*" non-greedy by adding a "?" is enough to prevent the issue.

<ins.*class="adsbygoogle"(.|\n)*<\/ins>

to:

<ins.*class="adsbygoogle"(.|\n)*?<\/ins>

So we can simply compile in a new recursion limit to pcre to prevent crashes until we either move to pcre2 (if that even works any better) for Windows and macOS which use our own 3rdparty build.

But we need to figure out how to handle things on Linux when people use system level pcre libraries to build Sigil with that do not have the stack recursion limit set (unless pcre 8.45 somehow makes things work).

kevinhendricks commented 3 years ago

Interestingly enough pcre2 has no equivalent for PCRE_MATCH_LIMIT_RECURSION so this particular bug should not exist in pcre2 in the very latest versions.

If I get some free time, I will look into trying to see how hard it would be to move Sigil to use the pcre2 libraries

kevinhendricks commented 3 years ago

Did some searching about ulimits on stack size and most recent Linux uses 8,192 kb just like macOS. If a call frame takes up roughly 500 bytes on the stack, that means the recursion limit could go as high as 16000, so perhaps changing this limit from 10,000,000 to 12,000 or so makes the most sense here. I will push that change to master and we can test if it prevents the segfault on macOS, Linux, and Windows and adjust it up or down.

Still have no idea how to deal with this issue for system level pcre (version 1) libs being used on *nix.

Any ideas welcome.

eli-schwartz commented 3 years ago

The most logical way to solve this would be if pcre has both a configure time option to set the default limit, and a runtime parameter to override or further reduce the limit.

If it's configure time only then that seems like a bit of awkward usability...

kevinhendricks commented 3 years ago

I will look into that. Pushed to master 3rdparty build change to use a 12000 recursion limit and tested it on macOS with this test case and it does its job.

kevinhendricks commented 3 years ago

Okay, based on pcretest.c there appears to be a way to programmatically set the recursion limit after the study phase is done (after compiling the pattern).

I will try to look into it and how we could incorporate it into our C++ wrapper for pcre.

kevinhendricks commented 3 years ago

Found enough info in the pcreabi docs to figure out how to set the recursion limit for each search upon creation (runtime). Reverted the pcre cmake change and added to master the runtime setting of the recursion limit setting so it should work just fine when Sigil is built with external pcre libs or internal pcre.

FWIW - we never seem to be using the JIT anywhere here. We may want to look into that at some point.

If someone with access to the Linux box with an external pcre lib, please use this test case to verify the segfault no longer occurs.

dougmassay commented 3 years ago

I'll check it out and get back to you.

dougmassay commented 3 years ago

I can confirm that the original test case does not segfault (with either the bundled pcre or an external pcre) with the new runtime recursion limit in place. Nice!

kevinhendricks commented 3 years ago

Great! I will upgrade to version 8.45 in 3rdparty and we can explore pcre2 at some future point, if we want.

Closing this.

kevinhendricks commented 3 years ago

Some final notes on this issue:

At some point we should explore if using the JIT will actually speed up Find and Replace in Sigil and if so, enable it. pcre is smart enough to fallback to the current non-jit approach is need be according to the docs.

kevinhendricks commented 3 years ago

I am re-opening this because we now have a decision to make.

I looked into all of the Sigil code and could not find even one place where we used QtConcurrent (multiple threads) to run a search anyplace. This is due to the wrap flag and starting file in Find and Replace as order of search matters and must be kept.

So if we enable pcre's JIT it will use its own jit match() which has a much smaller call frame during recursion.

That then allows us to use a single threaded approach to allocating a growing jit stack for pcre (starts at 32K and grows to 1 meg) and then clean up after ourselves in our SPCRE class.

I modified my local repo to do just that. With this change in place, we get no segfault AND that search actually succeeds (and properly grabs both cases in greedy mode). No errors at all.

If the JIT fails for any reason it will fall back to the normal non-jit match and our recursion limit still prevents segfaults.

That then begs the question:

  1. Should we move to using pcre's JIT and then allocating our own JIT growing stack so that valid searches do actually succeed (just like on Calibre for example)

Or

  1. Should we leave it as is (with no JIT) and get a Not Found message to indicate that the search as it stands can not be run.

FWIW ... we have always enabled pcre's JIT in Sigil (in 3rdparty) but we have for some reason never actually used it. I would assume that all external pcre libraries enable JIT since the list of arches it supports is quite vast.

So my personal vote is to move to using pcre's JIT capability with our own stack allocation during runtime so that even strange regex's will not crash Sigil and will actually succeed.

@dougmassay and @eli-schwartz Any thoughts?

eli-schwartz commented 3 years ago

The Arch package for pcre definitely uses --enable-jit so that's no problem here.

Making sigil better by making regexes succeed instead of running into recursion limits and aborting the search? Seems like a no-brainer, do it!

kevinhendricks commented 3 years ago

Based on a google search, ubuntu (and its derivatives?) may not enable the pcre jit in "trusty" and earlier (and even possibly later?). If we do this, we should test on such a platform.

In the worst case, ubuntu users on trusty and older systems can build with our internal 3rdparty version as it has been updated to the very latest (and final) version of the pcre lib now.

eli-schwartz commented 3 years ago

Looking at the Debian/Ubuntu packaging they do have:

Ubuntu 14.04 is way too long ago to worry about IMO, they'd have only Qt 5.2.1 anyway plus even on the scale of LTS releases it's no longer supported.

kevinhendricks commented 3 years ago

I will push the change. If needed, we can always modify the code to use a build time define so that it can be built with or without pcre JIT.

kevinhendricks commented 3 years ago

Pushed support for using pcre JIT for our C++ wrapper: SPCRE.cpp/.h

dougmassay commented 3 years ago

If needed, we can always modify the code to use a build time define so that it can be built with or without pcre JIT.

We should probably do this sooner rather than later. I wouldn't want Sigil 1.7.0 to catch package maintainers unaware (Sigil 1.6.0 is in Debian experimental). I can probably look at it this weekend. Maintainers are probably the only ones who take the time to build Sigil with the 3rd-party libs debundled.

Does anyone have a way of pinging @mapreri (Mattia) for his thoughts? Any Sigil packages in Debian/Ubuntu would probably originate with him.

kevinhendricks commented 3 years ago

FWIW, It passed our build ci via github actions and according to the build log it used its external pcre lib in its build. What distribution of Linux do we use for our github ci builds?

kevinhendricks commented 3 years ago

I will modify the code by adding a check for PCRE_NO_JIT and if defined then skip those parts of the code that enable and use pcre's JIT.

I will push that today.

eli-schwartz commented 3 years ago

https://github.com/Sigil-Ebook/Sigil/blob/3583274f74c794de76a33c1d60f2d803a219f68d/.github/workflows/linux-build.yml#L40

But it's running on x86_64, not PowerPC or sparc, so I presume it has JIT. There would probably only be problems for the auto builds on alternative architectures, which aren't easily tested in CI.

dougmassay commented 3 years ago

Our Linux CI is building on Ubuntu 18.04. I'm checking for system libs in the CI build, but not requiring it. Trying to make sure all of the 3rd-party libs are installed by the CI runner adds considerable time to the run. That said, the Ubuntu 18.04 runner appears to have pcre installed and the Sigil build is picking it up. I'm just not sure it's the stock version of libpcre since the version being reported is 8.44. Ubuntu 18.04 reports version 8.39 for libpcre16-3 in Bionic.

The Linux CI is basically geared toward making sure we're still compatible with users building on a mostly stock Ubuntu 18.04.

But as Eli said: it's probably only an issue on alternative architectures.