albertodemichelis / squirrel

Official repository for the programming language Squirrel
http://www.squirrel-lang.org
MIT License
913 stars 156 forks source link

Greedy regexp bugs #181

Open jayeheffernan opened 5 years ago

jayeheffernan commented 5 years ago

I've noticed that regexps with "greedy closures" (*) don't work properly. Matches can be shorter than expected, or matching can totally fail when it should succeed.

Example Code

The example tests the simple regexp /a.*bc/ against 3 different strings.

// Function to format a regexp capture result into a nice readable string
function formatCapture(cap) {
    if (cap == null) return "no match";
    // If there is a match, show the start and end index, and the length
    else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin);
}

local re = regexp("a.*bc");
local strs = [
    "a bc",
    "a bc bc",
    "a b bc",
];

foreach (str in strs)
    print(formatCapture(re.capture(str)) + "\n");

Example Output

Actual

I have run the example code above with Squirrel 3.1. Note that the second match is the same length as the first, and that the last string does not match.

[0,4] (4)
[0,4] (4)
no match

Expected

I would expect the second match to be longer than the first, and for all strings to match.

[0,4] (4)
[0,7] (7)
[0,6] (6)

What's Wrong?

"a bc bc"

The .* in the regexp should cause its bc to match the last bc in the string, because it should greedily capture past the first bc. Instead, the first bc is matched, so that the entire span of the match is shorter than expected.

We might expect to be able to force the second bc to be matched by adding a $ anchor: changing regexp("a.*bc") to regexp("a.*bc$"). Actually, this causes the string to not match at all (it should still match).

"a b bc"

Similarly, the .* should greedily match over the b in the middle of this string, and bc should match bc. Instead, the regexp does not match this string at all.

I think this could have something to do with the implementation of backtracking. It seems like maybe the b in the regexp is matched to the first b in the string, and then the c does not match, and it fails to try other alternatives for where the b might match.

Comparison to Other Engines

My familiarity with regexps says that this behaviour is incorrect, but I have checked to confirm with the regexp implementations in Vim and Javascript. Both behave as expected.

Example Javascript Code

Here's some (roughly) equivalent example code in Javascript, which you can run by pasting into the console, or downloading to run with Node.js. Both will give the same output.

// Function to format a regexp capture result into a nice readable string
function formatCapture(cap) {
    if (cap == null) return "no match";
    // If there is a match, just show the length of the match
    else return "" + cap[0].length;
}

const re = /a.*bc$/;
const strs = [
    "a bc",
    "a bc bc",
    "a b bc",
];

for (str of strs)
    console.log(formatCapture(re.exec(str)));

Javascript Output

4
7
6

As expected, all 3 strings match, and the 2nd and 3rd matches are longer than the first due to the greediness of the .*.

jayeheffernan commented 5 years ago

@albertodemichelis I've just found a forum post of yours which seems to indicate some known and ongoing issues with the regexp library. You mention there that the only real fix would be a rewrite. Would you consider that? Maybe see if there's an existing implementation that would fit the requirements? SLRE seems like a good candidate: it has a simple API that seems to match what Squirrel needs, implements a similar feature set and is a little smaller than the current implementation in terms of code space, uses no dynamic memory allocation, and is "easily modifiable for custom needs".

renshijun commented 4 years ago

file sss.cc:

function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*bc");

local strs = [ "a bcd", "a bc bcd", "a b bcd", ];

foreach (fff in strs) { re = regexp("a.*bc");
print(formatCapture(re.capture(fff)) + "\n"); }

returns:

[root@etchosts test_scripts]# ../bin/sq sss.cc [0,4] (4) [0,7] (7) [0,6] (6)

====================================================== file sss1.cc:

function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*bc");

local strs = [ "a bcd", "a bc bcd", "a b bcd", ];

foreach (fff in strs) { print(formatCapture(re.capture(fff)) + "\n"); }

returns: [root@etchosts test_scripts]# ../bin/sq sss1.cc [0,4] (4) [0,4] (4) [0,4] (4)

============================================================= file sss2.cc:

function formatCapture(cap) { if (cap == null) return "no match"; // If there is a match, show the start and end index, and the length else return format("[%d,%d] (%d)", cap[0].begin, cap[0].end, cap[0].end - cap[0].begin); } local re = regexp("a.*?bc");

local strs = [ "a bcd", "a bc bcd", "a b bcd", ];

foreach (fff in strs) { re = regexp("a.*?bc");
print(formatCapture(re.capture(fff)) + "\n"); }

returns: [root@etchosts test_scripts]# ../bin/sq sss2.cc [0,4] (4) [0,4] (4) [0,6] (6)

reload re= regexp("a.*?bc"); is ok. It seem bug exist in cache?

renshijun commented 4 years ago

regular expression object should be reinitialized, that will be ok! the new sqstdrex.cpp located in https://github.com/renshijun/modified_files.

renshijun commented 4 years ago

The problem have been solved. refer to https://github.com/mingodad/squilu/issues/3