Performance degradation from swipl 8.4.3 to 9.0.x

jasuser0 commented 1 year ago

Thanks a lot for the work on swipl and the progress made for version 9!

But... when testing on version 9 with windows binary (all the same for 9.0.1 up to 9.0.4, 32bit and 64bit), my application is running, but it takes approx. double time to complete a long-running request on the console. And this hinders me to move to version 9 for the time being...

So, what can I expect for next versions, or what can I do for now, or is this already known and reported by others?

JanWielemaker commented 1 year ago

I'm not aware of this. Might be wiser to ask on Discourse as much more people follow the discussion there.

There are no known changes that should cause significant degradation. There are some small improvements and small degradations and new C compiler versions tend to be a little better, sometimes a little worse.

Running the benchmarks show now significant difference either. Possibly it is related to some specific aspect of your code? You could try the profiler on both the old and new versions so see whether some specific predicate stands out?

jasuser0 commented 1 year ago

Thanks for answering, I try using the profiler next (first time). So far I was focussing on results only, but 280000 ms on 8.4.3 versus 560000 ms on 9.0.4 surprised me (using the same code). If it is okay, I can put my findings here?

JanWielemaker commented 1 year ago

That is surely not good. I typically don't bother about 5% difference as that seems to result from good/bad luck with caching and pipelines.

If the source code is available somewhere, please share (as well as your findings).

jasuser0 commented 1 year ago

Using the profiler (I hope I did it correctly) I did not get a clear indication.

The profiler in 8.4.3 gave the top times for the following (>14sec, "profile by time self"): grafik

with 9.0.4 the profiler with the same code gave the following (>14sec, "profile by time self"): grafik

kamahen commented 1 year ago

The profiler output seems to imply that the performance problem is in library(pcre). That library has been upgraded from PCRE1 to PCRE2, which should have resulted in a performance improvement. As part of the changes, the method of caching compiled REs has changed to use SWI-Prolog's tabling - and it appears that that is a significant part of your performance problem.

Is it possible for you to use re_compile/3 instead of relying on the builtin caching? That might solve your problems. Also note that PCRE2 has various optimization options that weren't available with PCRE1 - those could result in slower compilation but faster matching.

Do you have a lot of different REs? Are they specified as atoms or strings? (@jan - is there a way of outputting statistics for the various tabled predicates in library(pcre)?)

BTW, I've found that PGO gave me big performance improvements with a heavily backtracking Prolog program (around 22%) ... if your PCRE2 library has been built with the wrong optimization parameters, that could be part of the problem. Which platform are you running on and how do you install PCRE2? (And do you have a recent version of PCRE2?)

JanWielemaker commented 1 year ago

As part of the changes, the method of caching compiled REs has changed to use SWI-Prolog's tabling - and it appears that that is a significant part of your performance problem.

Looks like it. For the old pcre interface we merely caches strings to compiled patterns. Now we have to deal with the option list as well, which takes a lot more steps. Tabling speeds this up a little, but doesn't get back to where we used to be. Possibly explicit caching based on term_hash/2 does a better job. Or reduce the tabling overhead ...

Is it possible for you to use re_compile/3 instead of relying on the builtin caching? That might solve your problems.

:+1:

is there a way of outputting statistics for the various tabled predicates in library(pcre)?)

If you install https://github.com/JanWielemaker/my-prolog-lib, you have tdump/0 to dump all tables.

Which platform are you running on and how do you install PCRE2?

@jasuser0 says Windows. Uses PGO and pcre from the MinGW project. I think the compiler used for 9.0.4 is mingw-gcc 11. Performance is some 20% lower than Linux on the same hardware. I suspect the malloc functions are a lot slower, but never really timed it.

kamahen commented 1 year ago

One thing I noticed - blob/2 takes a lot of time on both versions. This is only used in re_compiled/2, which implies a lot of lookups for regexp strings.

@jasuser0 - is your code available, so that I could try it? (I only have Linux systems; but it might be useful to see if there's different performance on Linux, because that might indicate a problem with the PCRE2 libraries on Windows)

jasuser0 commented 1 year ago

Thanks for commenting and discussing, this gave me the opportunity to better understand and check the code. I've understood that re_match (frequently used and taking much time) is one of the predicates to start with. First, I've optimised the code by merging multiple re_match into one, merging the regexs, where possible. The results gave significant improvements in excecution time of the same code in 8.4.3 (by approximately 30%): 8.4.3: [195347ms] 9.0.4: [531046ms] But obviously this is not solving the findings with 9.0.x. I'd like to tear-down the code further to find the bottleneck with using profile. See updated show_profile for both versions:

jasuser0 commented 1 year ago

One thing I noticed - blob/2 takes a lot of time on both versions. This is only used in re_compiled/2, which implies a lot of lookups for regexp strings.

@jasuser0 - is your code available, so that I could try it? (I only have Linux systems; but it might be useful to see if there's different performance on Linux, because that might indicate a problem with the PCRE2 libraries on Windows)

When having isolated the problem better in the code I am hopefully able to strip the code down to a reasonable size.

JanWielemaker commented 1 year ago

I don't know which is which, but the difference is marginal. The interesting node is probably re_compiled/3. Where that used to do a very simple job, supporting most PRCE2 features required an option list, so there is quite a bit of processing before we can call re_compile/3 to do the actual compilation. We don't want to do that over and over, so we use tabling to avoid redoing the work. Unfortunately the overhead of this is significant. It is lower than the explicit caching code used in the first version.

Are the regex calls based on literal regex patterns, i.e., appearing in the code, or are they dynamically generated? In theory we can avoid the whole overhead for re_match/3 calls where we know the pattern and options at compile time by compiling the regex and using the compiled regex in the clause. The price for that it that you cannot create a saved state of the program because the state cannot (de)serialize the unknown internal regex structures. That can probably be fixed, but that takes some effort.

If they are dynamically generated patterns, you can possibly compile them and pass the compiled one along.

Giving some code snippets of what you are doing may help to understand the problem.

JanWielemaker commented 1 year ago

Pushed SWI-Prolog/packages-pcre@fff0e6d8f08259761092808b8bd6f7e54ca02875 that allows for using precompiled regular expressions. This cuts the overhead for simple re_match/2,3 calls in 3 after using

 :- set_prolog_flag(re_compile, true).

You can get pcre.pl from the linked repo and dump it into your installation (or wait a little for the nightly build).

Note that avoiding regular expressions and using e.g. sub_atom/5, atom_concat/3, etc., is probably a lot faster when it is enough. The profiler may point you at specifically expensive calls.

jasuser0 commented 1 year ago

Well, as I understood the question is about which regex I use. Half of the calls are static in the code (1), half are dynamic (2), such as: (1) re_match('hallo',Text), re_match('(world|galaxy)',Text). (2) re_match(Filter,Text). But my understanding is that execution time on re_match alone does not differ between swipl versions any more (8.4.3 was first, 9.0.4 second)...

JanWielemaker commented 1 year ago

But my understanding is that execution time on re_match alone does not differ between swipl versions any more (8.4.3 was first, 9.0.4 second)...

Looks like it. It is the compilation caching that is way more complicated and slower (I think).

kamahen commented 1 year ago

The old PCRE had a significant bug in how it did caching of compiled regexps - it didn't properly use options in the cache look-up. Fixing this is probably why lookups in the new PCRE are slower.

For your specific situation, some lookup improvements might be possible. For example, pcre.pl has to use both the regexp and its options for caching - but if you use the same options everywhere, you could either use assertz or put the items into a rbtree for (I hope) faster lookup (or would tries be faster?). It appears that the time spent in re_compile/3 is low; the expensive operation appears to be checking the cache. This suggests to me that most of the cache lookups find something.

It might be useful to see the profiles both with cumulative(true) and cumulative(false). @jan - is there a way for @jasuser0 to dump out the raw profile information so that I could import it and poke around with show_profile/1?

@jan's change ( https://github.com/SWI-Prolog/packages-pcre/commit/fff0e6d8f08259761092808b8bd6f7e54ca02875 ) might present problems with saved states -- it shouldn't be difficult to add support, but I'd have to do a bit of investigation to make sure that compiled regexps can be used with multiple versions of PCRE2 - an alternative would be to save the regexp string when making the saved state and recompiling it on load (I think that blob save() and load() can handle this). But is there a need for handling saved states?

jasuser0 commented 1 year ago

Thank you for your guidance to profile the code and understand the changes made for swipl 9.0.x on regex.

I can now say that two improvements in my code made it possible for me to change to 9.0.4 now, without the massive absolute execution time increase as before.

Change 1 was: "\+ re_match('(world|galaxy)',Text)" instead of "\+ re_match('world',Text), \+ re_match('galaxy',Text)"

Change 2 was: "re_match('(world|galaxy)',Text) -> re_replace('world','home',Text,Text2), re_replace('galaxy','milkyway',Text2,Text3); Text3=Text" instead of "re_replace('world','home',Text,Text2), re_replace('galaxy','milkyway',Text2,Text3)"

The runtime results are much better now: With 8.4.3: 98044ms With 9.0.4: 166980ms (still 50% more than 8.4.3, but only 25% of before with 9.0.4, and only 50% of before with 8.4.3)

So that's fine for me now 🙂!

JanWielemaker commented 1 year ago

Closing. Related to library(pcre) only. Regex compilation has been added to address this issue.

SWI-Prolog / swipl-devel

Performance degradation from swipl 8.4.3 to 9.0.x #1115