PCRE2Project / pcre2

PCRE2 development is now based here.
Other
910 stars 191 forks source link

Case-insensitive search gets exponentially bad with long buffers #3

Closed PhilipHazel closed 3 years ago

PhilipHazel commented 3 years ago

This is bug 2793 from the old Bugzilla, posted by Thomas Tempelmann, who, after some discussion, provided a proposed patch. Here is some relevant discussion and the patch:

Here's what probably happens:

  1. Fast Scan for "E". Takes long because it's down 5 MB
  2. Now it goes back to the start, finds "e" and checks the rest of the search string.
  3. No match. So, it moves forward.

At this point in the loop, instead of going on with step 2, it goes back to step 1, where it again searches ahead for 5 MB until it runs into the first "E".

I can think of several remedies:

Change the fast scan to include searching all possible options. In my example, it has to scan for both "e" and "E". I assume this benefits by using a specialized CPU instructions that can scan for a byte (because, if not, you'd simply do a loop where you get one byte and check it then against both "e" and "E")? So, what you'd need to do is to use that search operation in small ranges, e.g. over 1000 bytes, looking for "e", and then the same 1000 bytes looking for "E". If none hits, move forward. But if one hits, the nearer one is processed (and the farther one's position can be cached so that you won't need to search again for it until you've moved there).

But currently, instead, I suspect it's searching for "E" and eventually gives up when it reaches the cut-off point you mentioned. But then repeats the same long search again and again.

So, the next possible optimization, which may be much easier to implement than the first suggestion, is to simply cache the point at which the "E" was found or not, and then not repeat looking for "E" before that point.

Actually, can you tell me where this happens (if you don't have time to look now, can you give me some pointers where to look)? I like to try the caching myself, it shouldn't be too hard I hope.

Thru the macro option to suppress the fast scan, I located the relevant code areas. Around line 6800 in pcre2_match.c the comment explains that, in caseless mode, it does indeed consider looking for both cases.

Alright, it's as I suspected:

First off, the code already does what I suggested to do: Scan for both "e" and "E" and then process the nearer one.

The "bug" is that the found locations are not cached, so the next time both chars are searched again from the current position, even if it has already been determined that there's no such char for a while.

Adding some caching for both found locations should fix this. 1.

Change

BOOL memchr_not_found_first_cu; BOOL memchr_not_found_first_cu2;

into

PCRE2_SPTR memchr_found_first_cu; PCRE2_SPTR memchr_found_first_cu2;

2.

Change

memchr_not_found_first_cu = FALSE; memchr_not_found_first_cu2 = FALSE;

into

memchr_found_first_cu = NULL; memchr_found_first_cu2 = NULL;

3.

Change

      if (!memchr_not_found_first_cu)
        {
        pp1 = memchr(start_match, first_cu, end_subject - start_match);
        if (pp1 == NULL) memchr_not_found_first_cu = TRUE;
          else cu2size = pp1 - start_match;
        }

      /* If pp1 is not NULL, we have arranged to search only as far as pp1,
      to see if the other case is earlier, so we can set "not found" only
      when both searches have returned NULL. */

      if (!memchr_not_found_first_cu2)
        {
        pp2 = memchr(start_match, first_cu2, cu2size);
        memchr_not_found_first_cu2 = (pp2 == NULL && pp1 == NULL);
        }

into

      if (start_match <= memchr_found_first_cu) {
        pp1 = memchr_found_first_cu;
        if (pp1 == end_subject) {
          pp1 = NULL;
        }
      } else {
        pp1 = memchr(start_match, first_cu, cu2size);
        if (pp1 == NULL) {
            memchr_found_first_cu = end_subject;
        } else {
            memchr_found_first_cu = pp1;
        }
      }

      /* If pp1 is not NULL, we have arranged to search only as far as pp1,
      to see if the other case is earlier, so we can set "not found" only
      when both searches have returned NULL. */

      if (start_match <= memchr_found_first_cu2) {
        pp2 = memchr_found_first_cu2;
        if (pp2 == end_subject) {
          pp2 = NULL;
        }
      } else {
        pp2 = memchr(start_match, first_cu2, cu2size);
        if (pp2 == NULL) {
          memchr_found_first_cu2 = end_subject;
        } else {
          memchr_found_first_cu2 = pp2;
        }
      }

This means two changes to the algorithm:

  1. Instead of using a flag to tell whether memchr() found something, it'll now store the last found position and re-use that as long as the start_match pointer is still behind.

  2. The size parameter for the second memchr() is not getting reduced any more (formerly, if the first found a location, the second one would be limited to search until there). Since we now cache each location, there's no need to shorten the searches any more.

PhilipHazel commented 3 years ago

I have applied a modified version of this patch to both interpreters.