Forever-Young / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 0 forks source link

findall() broken (seems like memory corruption) #101

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run python -c "import regex; flags = regex.FULLCASE | regex.UNICODE; pat = 
regex.compile(r'xxx', flags=flags); print ([x.group() for x in 
pat.finditer('yxxx')]); print (pat.findall('yxxx'))"
2. The flags are necessary

What is the expected output? What do you see instead?
Expected output:
['xxx']
['xxx']

Actual output:
['yxx']
['xxx']

What version of the product are you using? On what operating system?
On linux 64bit the finiter() expression produces the correct result and the 
findall() does not. On windows 32bit it is the reverse, i.e. finditer() works, 
findall() does not.

Since this appears to be memory related, it may depend on the alyout of the 
code in memory, in which case you may not be able to reproduce it easily.

Original issue reported on code.google.com by adse...@calibre-ebook.com on 31 Dec 2013 at 8:55

GoogleCodeExporter commented 9 years ago
This is with the current source checkout of regex. 

Note that changing the test to "import regex; flags = regex.FULLCASE | 
regex.UNICODE; pat = regex.compile(r'xxx', flags=flags); raw='yxxx'; print 
([x.group() for x in pat.finditer(raw)]); print (pat.findall(raw))"

Which keeps a reference to the string being searched on, causes both finditer 
and findall to fail on windows 32 and no change in linux 64.

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 9:14

GoogleCodeExporter commented 9 years ago
Using unicode string literals also does not affect the failure. (I am testing 
on python 2.7)

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 9:16

GoogleCodeExporter commented 9 years ago
Passing in the IGNORECASE flag causes the matching to work, presumably because 
it causes a memcpy?

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 9:36

GoogleCodeExporter commented 9 years ago
Similarly, passing in the REVERSE flag also causes it to work.

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 9:45

GoogleCodeExporter commented 9 years ago
It looks like req)pos and req_end are not initialized in state_init_2()

I am still testing but adding

state->req_pos = -1;

in state_init_2() fixes my test case. However, I dont understand the code well 
enough. The bug happens when the value state->req_pos gets happens to be 0.

Is -1 the right value to initialize it to. What about req_end?

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 12:04

GoogleCodeExporter commented 9 years ago
The following diff fixes all instances of this bug that I was able to 
reproduce. However, I do not know if the fix is correct or complete.

diff --git a/src/regex/_regex.c b/src/regex/_regex.c
index 43f2135920..4418ccd817 100644
--- a/src/regex/_regex.c
+++ b/src/regex/_regex.c
@@ -13232,6 +13232,7 @@ Py_LOCAL_INLINE(BOOL) state_init_2(RE_State* state, 
PatternObject* pattern,

     state->groups = NULL;
     state->repeats = NULL;
+    state->req_pos = -1;
     state->visible_captures = visible_captures;
     state->match_all = match_all;
     state->backtrack_block.previous = NULL;

Original comment by adse...@calibre-ebook.com on 31 Dec 2013 at 12:14

GoogleCodeExporter commented 9 years ago
You are correct. state->req_pos wasn't initialised.

Fixed in regex 2013-12-31.

Original comment by re...@mrabarnett.plus.com on 31 Dec 2013 at 8:42