declaresub / abnf

Python parser generator for ABNF grammars
MIT License
41 stars 9 forks source link

`RFC3986` Path rule issue? #24

Open st235 opened 3 months ago

st235 commented 3 months ago

It seems that path should also be reordered, right?

path-abempty always matches any string as it accepts 0 iterations.

This is a snippet I am referring to:

        "path = path-abempty / path-absolute / path-noscheme / path-rootless / path-empty",
        'path-abempty = *( "/" segment )',
        'path-absolute = "/" [ segment-nz *( "/" segment ) ]',
        'path-noscheme = segment-nz-nc *( "/" segment )',
        'path-rootless = segment-nz *( "/" segment )',
        "path-empty = 0pchar",
st235 commented 3 months ago

path-empty also looks similar to path-abempty. Do you think it would be better to reorder the rules in the following order:

path-absolute / path-noscheme / path-rootless / path-abempty and remove path-empty?

declaresub commented 3 months ago

Did you encounter a problem because of this? I think the current version of abnf has enough backtracking to handle this case. Probably we need a test or two to ensure that this isn't a problem.

st235 commented 3 months ago

Did you encounter a problem because of this? I think the current version of abnf has enough backtracking to handle this case.

Hey, thank you for a swift reply!

Actually, nope, don't have any problems so far, but I have not tested for this specific scenario. It just looked peculiar when I read the ABNF notation, so I may be utterly wrong here.

Sorry if I confused you!

declaresub commented 3 months ago

You're not wrong to make this observation; there are several rules in the various RFCs akin to this one. I should add a test for this rule to ensure that the parser generated for RFC3886 does what one expects.

st235 commented 3 months ago

I saw that you reordered dec-octet and that is what I had in mind when I was googling for problems with dec-octet. This is actually how I found your repo in the first place 😅

I am talking about this part of the code

# alternatives reordered for correct matching.
'dec-octet = "25" %x30-35 / "2" %x30-34 DIGIT / "1" 2DIGIT / %x31-39 DIGIT / DIGIT',

(On a side note, really impressive project!)

Another problem within RFC3986 I am feeling is happening is path rule, and I was wondering if you had any issues with this rule or noticed something similar?

At least it seems that path-abempty and path-empty are similar when matching for an empty string. Given both of them match for it, path-abempty should always win. (not a big problem though that one of the rules is not used).

At the same time, it seems that one can match path-abempty with literally any string as it may converge to matching an empty substring, and given that the query and fragment are not the children of path rule, I am not 100% sure how it is resolved.

Take this with a grain of salt, as I may be wrong, though would be happy if it turns out to be helpful.

st235 commented 3 months ago

@dekkers I actually figured it out, sorry for the hustle. It seems that path rule itself is not used anywhere in the grammar.

Path children expressions are used in heir-part and relative-part

...
'hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty',
...
'relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty',

Though in these rules path-abempty is protected by terminal // symbols that help to select between alterations.

Really confusing definitions! So URI-rule should be safe and sound.

Though the issues hold for path-rule.

The snippet below (with empty string) always produces AST with one child: path-abempty, so it does not reach path-empty.

src = ''
node, start = rfc3986.Rule('path').parse(src, 0)
print(str(node))
Node(name=path, children=[Node(name=path-abempty, children=[])])