badicsalex / peginator

PEG parser generator for creating ASTs in Rust
MIT License
34 stars 3 forks source link

How to use custom char check? #3

Closed oovm closed 2 years ago

oovm commented 2 years ago

My identifiers are defined as follows:

@string
@no_skip_ws
Ident = (XID_START | '_') (XID_CONTINUE)*

where XID_START represents an external function UnicodeXID::is_xid_start.

How should I capture my Ident token?

badicsalex commented 2 years ago

Unfortunately this can only be done with quite a bit of hacking currently. Maybe I should implement proper "external function" support.

But if you really want to do it right now, there is a way.

You have to manually create a parse_XID_START function in the same namespace as the compiled grammar (should be easy if you use the macro, a bit harder if you use a buildscript).

Something like the following:

peginate!("
@export
Idents = {idents:Ident};

@string
@no_skip_ws
Ident = (XID_START | '_') {XID_CONTINUE};
");

pub fn parse_XID_START<'a, _CT>(
    state: ParseState<'a>,
    _tracer: impl ParseTracer,
    _cache: &_CT,
) -> ParseResult<'a, char> {
    // Boilerplate
    let result = state.s().chars().next().ok_or_else(|| {
        state
            .clone()
            .report_error(ParseErrorSpecifics::Other)
    })?;

    // Actual business logic
    if !result.is_xid_start() {
        return Err(state.report_error(ParseErrorSpecifics::Other));
    }

    // More boilerplate
    // We are skipping a full character, so we should be OK.
    let state = unsafe { state.advance(result.len_utf8()) };
    Ok(ParseOk {
        result,
        state,
        farthest_error: None,
    })
}

pub fn parse_XID_CONTINUE<'a, _CT>(
    state: ParseState<'a>,
    _tracer: impl ParseTracer,
    _cache: &_CT,
) -> ParseResult<'a, char> {
    // Boilerplate
    let result = state.s().chars().next().ok_or_else(|| {
        state
            .clone()
            .report_error(ParseErrorSpecifics::Other)
    })?;

    // Actual business logic
    if !result.is_xid_start() {
        return Err(state.report_error(ParseErrorSpecifics::Other));
    }

    // More boilerplate
    // We are skipping a full character, so we should be OK.
    let state = unsafe { state.advance(result.len_utf8()) };
    Ok(ParseOk {
        result,
        state,
        farthest_error: None,
    })
}

#[test]
fn test_macro() {
    let s = Idents::parse("xyz áé8").unwrap();
    assert_eq!(s.idents, vec!["xyz", "áé8"]);
}
badicsalex commented 2 years ago

I understand the above is not convenient. What if I implemented a syntax like this:

@custom_char(crate::some_module::check_xid)
XID_START;

And then in some_module.rs, you could have a function like this:

fn check_xid(char) -> bool {
    char.is_xid_start
}

Maybe even use the unicode_xid directly:

@custom_char(unicode_xid::UnicodeXID::is_xid_continue)
XID_CONTINUE;

Would it fit your use-case?

oovm commented 2 years ago

This hacking meets my needs.

If it were to stabilize as a feature I would like to be

@custom_char(char_xid_start) // advance 1 char
XID_START = 'ANY';  // annotative description, do not use
@custom_string(keyword_where, 5) // advance 5 chars 
WHERE = 'case insensitive where'; // annotative description, do not use

@check_string(keyword_checker)
KEYWORD = Ident; // Requires successful capture of Ident and keyword_checker to return true

with function signature

fn char_xid_start(char) -> bool;
fn keyword_where(&str) -> bool;
fn keyword_checker(&str) -> bool;
badicsalex commented 2 years ago

The syntax I'm currently thinking about is:

@char
@check(unicode_xid::UnicodeXID::is_xid_continue)
XID_START = char; # In this case "char" is actually used

@extern(crate::keyword_where -> String)
WHERE; # no body, prefer comments

@check(crate::keyword_check)
KEYWORD = Ident;

There would be two new additions:

@check directive The function gets whatever the rule spits out (char in case of @char rules, strings or structs in case of string or struct rules), and should return a bool. So fn char_xid_start(char) -> bool and fn keyword_checker(&str) -> bool fits here, but you could also do checks on more complex structures with multiple fields in the middle of parsing.

@extern directive

It is a completely external parse function with the following signature:

fn custom_fn(&str) -> Result<(T, usize), &'static str>

If the string can be parsed OK, you return a tuple with the result, and the amount of bytes (!) the parser consumed from the input, wrapped in OK. If it cannot be parsed according to the rule, you return a static error message string wrapped in Err.

In case of the keyword where, it would probably look something like this:

fn keyword_where(&str) -> Result<(String, usize), &'static str> {
    if str.to_uppercase() == "WHERE" {
        let result = str.chars().take(5).collect();
        Ok((result, result.len()))
    } else {
        Err("Expected string 'where' (case insensitive)
    }
}

Or you could also return () or a named empty struct for efficiency.

It could also be used to parse numbers in place with something like fn number(&str) -> Result<(i64, usize), &'static str>

You could also do the requested r#"-string feature. In that case you would return the parsed string literal, but skip the starting and ending ##-s. (I really don't want to implement the stack, I think it's not a good addition to PEGs)

Any comments?

badicsalex commented 2 years ago

By the way, is the case insensitive match common?

Because I think adding a case insensitive string literal and char literal shouldn't be a big problem (the biggest problem is coming up with a good syntax for it).

badicsalex commented 2 years ago

Please see if the newly added features satisfy your needs. If so, I'll close the issue.

oovm commented 2 years ago

Good, this approach is very scalable.