SWI-Prolog / packages-pcre

SWI-Prolog package for access to Perl Regular Expressions
5 stars 7 forks source link

ENHANCED: re_config/1 backtracks through all possible values #16

Closed kamahen closed 2 years ago

kamahen commented 2 years ago

This is a follow-on to https://github.com/SWI-Prolog/packages-pcre/pull/15

kamahen commented 2 years ago

Please don't merge this PR yet -- I fixed a bug in compare_pcres() that's been there for years (a comparison was backwards) and I should add a regression test for it.

JanWielemaker commented 2 years ago

Thanks

Please don't merge this PR yet

That probably also holds for the other today's PR's, no?

kamahen commented 2 years ago

I've force-pushed new code with a regression test. This PR is ready to review.

JanWielemaker commented 2 years ago

Thanks. Merged and added a patch for the comparison and writing of of Unicode patterns. For example:

?- re_compile('хелло', Aap, []), re_compile('noot', Noot, []), compare(D, Aap, Noot).
Aap = <regex>(0x55c1acc5dc00, /хелло/),
Noot = <regex>(0x55c1acc5dc70, /noot/),
D =  (>).

PL_atom_chars() for historical reasons simply returns a pointer to the blob content of the atom. PL_atom_wchars() allows extracting as a wide character string for text atoms and thus supports any atom.

kamahen commented 2 years ago

Thanks for the patch. However, it seems to have left a memory leak which only shows with non-ASCII strings (e.g., хелло).

re_compile("網目錦へび [àmímé níshíkíhéꜜbì]", Z1, []), re_compile("網目錦へび [àmímé níshíkíhéꜜbì]", Z2, []), compare(C, Z1,Z2).
Z1 = <regex>(0x60b0000161a0, /網目錦へび [àmímé níshíkíhéꜜbì]/),
Z2 = <regex>(0x60b000016300, /網目錦へび [àmímé níshíkíhéꜜbì]/),
C =  (<).

?- 

% halt
Running LSAN memory leak check (reclaim_memory=1)

=================================================================
==1209745==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 472 byte(s) in 1 object(s) allocated from:
    #0 0x7f93e4061808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x7f93e3b8bedd in PL_malloc ../src/pl-alloc.c:1169
    #2 0x7f93e3c18a74 in PL_canonicalise_text ../src/os/pl-text.c:1088
    #3 0x7f93e3c13233 in textToAtom ../src/os/pl-text.c:414
    #4 0x7f93e3c138a9 in unify_text___LD ../src/os/pl-text.c:468
    #5 0x7f93e3c148e9 in PL_unify_text ../src/os/pl-text.c:628
    #6 0x7f93e3a4cb33 in pl_raw_read2 ../src/pl-read.c:4870
    #7 0x7f93e39104e8 in PL_next_solution___LD ../src/pl-vmi.c:4626
    #8 0x7f93e3a0a24e in query_loop ../src/pl-pro.c:147
    #9 0x7f93e3a0bb84 in prologToplevel ../src/pl-pro.c:496
    #10 0x7f93e3bb8539 in PL_toplevel ../src/pl-fli.c:4564
    #11 0x5597a7eca280 in main ../src/pl-main.c:143
    #12 0x7f93e365b0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x240b2)
JanWielemaker commented 2 years ago

That looks like a memory leak in read for Unicode atoms. I'll check. Thanks for reporting.

JanWielemaker commented 2 years ago

Strangely enough it was not reported here. The stack trace was good enough to spot it though. Luckily only affects reading for the toplevel of Unicode queries. Fixed. Also pushed a patch for pcre to remove the O_DEBUG stuff. The release version triggered a use-after-free because the pattern may be gone before the regular expression during process cleanup. In this case not doing the print is fine. In general I see there could be cases where foreign code would like to control the order in which atoms are reclaimed during shutdown. This is hard though.