Open jayeheffernan opened 5 years ago
@albertodemichelis I've just found a forum post of yours which seems to indicate some known and ongoing issues with the regexp library. You mention there that the only real fix would be a rewrite. Would you consider that? Maybe see if there's an existing implementation that would fit the requirements? SLRE seems like a good candidate: it has a simple API that seems to match what Squirrel needs, implements a similar feature set and is a little smaller than the current implementation in terms of code space, uses no dynamic memory allocation, and is "easily modifiable for custom needs".
file sss.cc:
function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*bc");
local strs = [ "a bcd", "a bc bcd", "a b bcd", ];
foreach (fff in strs)
{
re = regexp("a.*bc");
print(formatCapture(re.capture(fff)) + "\n");
}
returns:
[root@etchosts test_scripts]# ../bin/sq sss.cc [0,4] (4) [0,7] (7) [0,6] (6)
====================================================== file sss1.cc:
function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*bc");
local strs = [ "a bcd", "a bc bcd", "a b bcd", ];
foreach (fff in strs) { print(formatCapture(re.capture(fff)) + "\n"); }
returns: [root@etchosts test_scripts]# ../bin/sq sss1.cc [0,4] (4) [0,4] (4) [0,4] (4)
============================================================= file sss2.cc:
function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*?bc");
local strs = [ "a bcd", "a bc bcd", "a b bcd", ];
foreach (fff in strs)
{
re = regexp("a.*?bc");
print(formatCapture(re.capture(fff)) + "\n");
}
reload re= regexp("a.*?bc"); is ok. It seem bug exist in cache?
regular expression object should be reinitialized, that will be ok! the new sqstdrex.cpp located in https://github.com/renshijun/modified_files.
The problem have been solved. refer to https://github.com/mingodad/squilu/issues/3
I've noticed that regexps with "greedy closures" (
*
) don't work properly. Matches can be shorter than expected, or matching can totally fail when it should succeed.Example Code
The example tests the simple regexp
/a.*bc/
against 3 different strings.Example Output
Actual
I have run the example code above with Squirrel 3.1. Note that the second match is the same length as the first, and that the last string does not match.
Expected
I would expect the second match to be longer than the first, and for all strings to match.
What's Wrong?
"a bc bc"
The
.*
in the regexp should cause itsbc
to match the lastbc
in the string, because it should greedily capture past the firstbc
. Instead, the firstbc
is matched, so that the entire span of the match is shorter than expected.We might expect to be able to force the second
bc
to be matched by adding a$
anchor: changingregexp("a.*bc")
toregexp("a.*bc$")
. Actually, this causes the string to not match at all (it should still match)."a b bc"
Similarly, the
.*
should greedily match over theb
in the middle of this string, andbc
should matchbc
. Instead, the regexp does not match this string at all.I think this could have something to do with the implementation of backtracking. It seems like maybe the
b
in the regexp is matched to the firstb
in the string, and then thec
does not match, and it fails to try other alternatives for where theb
might match.Comparison to Other Engines
My familiarity with regexps says that this behaviour is incorrect, but I have checked to confirm with the regexp implementations in Vim and Javascript. Both behave as expected.
Example Javascript Code
Here's some (roughly) equivalent example code in Javascript, which you can run by pasting into the console, or downloading to run with Node.js. Both will give the same output.
Javascript Output
As expected, all 3 strings match, and the 2nd and 3rd matches are longer than the first due to the greediness of the
.*
.