J-F-Liu / pom

PEG parser combinators using operator overloading without macros.
MIT License
496 stars 30 forks source link

Improper handling of `Incomplete` #14

Closed jsgf closed 6 years ago

jsgf commented 7 years ago

Pom does not handle propagation of incomplete input properly. If the input is too short to completely parse an input, it must immediately return an Incomplete error, and not attempt to continue parsing.

For example, the BitOr operator does not pay attention to why the left side failed to parse - it always goes on to try the right. If the left failed to parse because it returned Incomplete, then it should immediately return the same without trying to parse the right. Otherwise a parser of the form: seq(b"foobar") | seq(b"blitblat") will fail on the input b"foo", complaining that it expected a 'b' but got an 'f', because it improperly failed over to the right side when the left-hand parse was indeterminate.

Similarly, repeat() has:

            loop {
//...
                if let Ok(item) = self.parse(input) {
                    items.push(item)
                } else {
                    break;
                }
            }
            if let Included(&start) = range.start() {
                if items.len() < start {
                    input.jump_to(start_pos);
                    return Err(Error::Mismatch {
                        message: format!("expect repeat at least {} times, found {} times", start, items.len()),
                        position: start_pos,
                    });
                }
            }

which means it ends up reporting Error::Mismatch even if underlying parser returns Incomplete.

This problem seems common throughout the codebase; I think essentially every place that handles errors needs to special case Incomplete. Perhaps it shouldn't be represented as an error at all.

Without this, pom is not useful for processing input that comes in incrementally in buffers that may or may not contain a complete parsable phrase.

J-F-Liu commented 7 years ago

If so, seq(b"foobar") | seq(b"foo") will fail on the input b"foo" returns Incomplete. Processing input that comes in incrementally in buffers should have an additional flag indicating whether it is complete or not, should not leave the judgement to syntax parsing.

jsgf commented 7 years ago

If so, seq(b"foobar") | seq(b"foo") will fail on the input b"foo" returns Incomplete. Yes, I think that's fine - it's ambiguous in the presence of partial input. You need some additional "end of input" token to make it unambiguous. Without that the Incomplete error is not very useful.

Processing input that comes in incrementally in buffers should have an additional flag indicating whether it is complete or not, should not leave the judgement to syntax parsing.

Are you proposing something like an alternative set of parser combinators with a return type of Option<V> which successfully parse incomplete input as None, and complete input as Some<V>?

Is there some way to make an alt commit to a certain branch? If I have something like:

//...
        | seq(b"clonebundles\n") * params(0).map(|_| Request::Clonebundles)
        | seq(b"capabilities\n") * params(0).map(|_| Request::Capabilities)
        | seq(b"changegroup\n") * params(1)
//...

then if I see a complete keyword ("clonebundles"), then there's no point in trying any other alternative, because no others will match - any parse error is an error in the rest of this particular arm. That would at least help localize the errors for truncated input, rather than always reporting the error in the last arm of the alternatives.

J-F-Liu commented 7 years ago

It is practical not to try other alternatives if one arm already partially matched, but we need a formal definition of this new behavior at PEG level.

J-F-Liu commented 7 years ago

I thought out an algorithm: calculate common prefix tree of all arms, match input against prefix tree, then will able to produce more accurate error messages. For example, for

Rule = abcd
     | abde
     | acef

The prefix tree is:

    c
   /
  b
 / \
a   d
 \
  c

The difficult part is handling non-terminals in the rule.

Lehona commented 6 years ago

Is there a way to manually define "expectation points" (I think that's what they're called in Boost::Spirit)? E.g. when my parser sucessfully parses "fn" I expect the following parsers to succeed - if not, stop parsing. Although automatic detection of expectation points is surely nice, I think a manual way to do it would already help a lot... I'm currently using v1, so if there is such a thing in the beta: Nice! I couldn't find any documentation for it so far, so I can't use it right now :/

Edit : I've defined my own, crude, expectation-point trait:

pub trait ExpectParser<'a, I, O> where I: 'static, O: 'static {
    fn expect(self, msg: &'static str) -> Parser<'a, I, O>;
}

impl<'a, I: 'static, O: 'static> ExpectParser<'a, I, O> for Parser <'a, I, O> {
    fn expect(self, msg: &'static str) -> Parser<'a, I, O>
        where I: 'static
    {
        Parser::new(move |input: &mut Input<I>| {
            Ok(self.parse(input).expect(msg))
        })
    }
}

It panics on an unsuccessful parse, but I don't think there's a way to gracefully fail parsing without heavily modifying error propagation in pom.

J-F-Liu commented 6 years ago

@Lehona Can you give some use cases of expect()?

Lehona commented 6 years ago
pub fn declaration() -> Parser<'static, u8, Identifier) {
      seq(b"var")  * (identifier() - sym(b';')).expect("failed to parse var")
    | seq(b"func") * (identifier() - sym(b';')).expect("failed to parse func")
}

I will edit my post later to give a better example from my own code (I'm currently travelling), but it should get the point across. identifier() is some identifier parser - e.g. only alphanumeric, starting with a letter.

Imagine a language where you can either declare variables or functions in the global scope. Now consider this input:

var foo bar;

It won't parse successfully (because there are two identifiers instead of one) but without the .expect() the error message will say that it failed to parse correctly because it expected an 'f' and got a 'v'. Not very helpful, especially because that part was well-formed and not actually a syntax error.

This gets much worse with larger grammars, it will say it failed to parse (e.g.) a function when it was actually just a malformed number or something similar.

J-F-Liu commented 6 years ago

So the purpose is to abort early in ordered choice. I'll implement in this weekend.

Lehona commented 6 years ago

Yes, exactly! Thanks a lot. I felt really bad about abusing exceptions/panics as some form of control flow.

J-F-Liu commented 6 years ago

See commit ff6766e.