Closed davisjam closed 5 years ago
I believe this is due to the recursive nature of the parsing.
regexp.bnf includes:
Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
capturingGroupsCount++
// Create Node
}
Because we resolve the language recursively (I assume?), we update capturingGroupsCount
and create the Node
for inner groups before outer groups.
Given that inner nodes resolve before outer ones, I don't see a good way to address this issue in the parser. I think the cleanest solution might be to make a second pass over the AST after it is built and do capture group re-numbering at that point.
Thus I would suggest a change in either:
regexpTree.parse
regexpTreeParser.parser
One advantage of doing it in src/regex-tree.js
is that you have access to traverse
at that point.
But since parser
is exported by src/regex-tree.js
perhaps it should be hidden inside the parser.
Here's how I would do left-to-right numbering in src/regexp-tree.js
. This doesn't fix backreferences though, see below.
diff --git a/src/regexp-tree.js b/src/regexp-tree.js
index 15c9844..1d5fcc0 100644
--- a/src/regexp-tree.js
+++ b/src/regexp-tree.js
@@ -51,7 +51,22 @@ const regexpTree = {
* @return Object AST
*/
parse(regexp, options) {
- return parser.parse(`${regexp}`, options);
+ let ast = parser.parse(`${regexp}`, options);
+
+ let nextCaptureGroup = 1;
+ traverse.traverse(ast, {
+ 'Group': {
+ pre({node}) {
+ if (node.capturing) {
+ node.number = nextCaptureGroup;
+ nextCaptureGroup++;
+ }
+ }
+ }
+ }
+ );
+
+ return ast;
This behavior also introduces bugs in the interpretation of backreferences.
Consider:
((a)\2)
In regexp-tree
this backreference would be interpreted as referring to the outer group (see regexp.bnf excerpt below), which would not yet have been defined. So it will be rendered as a Char
instead.
/**
* Parses patterns like \1, \2, etc. either as a backreference
* to a group, or a deciaml char code.
*/
function GroupRefOrDecChar(text, textLoc) {
const reference = Number(text.slice(1));
if (reference > 0 && reference <= capturingGroupsCount) {
return Node({
type: 'Backreference',
kind: 'number',
number: reference,
reference,
}, textLoc);
}
return Char(text, 'decimal', textLoc);
}
Incorrectly interpreting \2
as a not-yet-defined backreference:
(15:41:16) jamie@woody ~/Desktop/floss/regexp-tree $ ./bin/regexp-tree --expression '/((a)\2)/'
{
"type": "RegExp",
"body": {
"type": "Group",
"capturing": true,
"number": 2,
"expression": {
"type": "Alternative",
"expressions": [
{
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Char",
"value": "a",
"kind": "simple",
"symbol": "a",
"codePoint": 97
}
},
{
"type": "Char",
"value": "\\2",
"kind": "decimal",
"symbol": "\u0002",
"codePoint": 2
}
]
}
},
"flags": ""
}
And incorrectly identifying the inner group as backreference 1:
$ ./bin/regexp-tree --expression '/((a)\1)/'
{
"type": "RegExp",
"body": {
"type": "Group",
"capturing": true,
"number": 2,
"expression": {
"type": "Alternative",
"expressions": [
{
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Char",
"value": "a",
"kind": "simple",
"symbol": "a",
"codePoint": 97
}
},
{
"type": "Backreference",
"kind": "number",
"number": 1,
"reference": 1
}
]
}
},
"flags": ""
}
I don't see a good fix for the "misinterpret \2 as a Char group" unless we just label backreferences as "possible backreference" and fix them during my proposed post-processing pass.
@davisjam, good catches! I'll try to take a look into this soon. Ideally we'll want to avoid double-passes over the AST at parsing level. Hopefully, it'll be possible; if not, yes, we'll need to have the second pass.
Yeah, we'll need to do a post-parse processing, but instead of doing the full pass of the whole AST, we'll just need to patch the capturing Group
.
For this, during the building the AST, we can collector the Group
nodes to some set, and then just patch each node.
Something like this:
const groupNodesToPatch = new Set();
...
Capturing Group: ... | L_PAREN Disjunction R_PAREN
{
capturingGroupsCount++
// 1. Create Node
// 2. Save node to update its index in post-processing
if (node.capturing) {
groupNodesToPatch.add(node);
}
}
And later on:
for (const group of groupNodesToPatch) {
node.number = capturingGroupsCount - groupNodesToPatch.number;
}
I'll try to reach it later, and will appreciate a PR in case you get earlier to it.
I took an approach along these lines in #165. I can't quite see how to make it work, because complete group number inversion is not correct either. Let's move the discussion on a fix to #165.
The regex
/(((a)b)c)/
contains 3 groups. Nested groups are numbered left-to-right by left parentheses. So backreference\3
matches "a", backreference\2
matches "ab", and backreference\1
matches "abc".For example:
I am running regexp-tree v0.0.85.
It looks to me like nested groups are numbered in reverse, with the outermost given the largest number and the innermost given the smallest.