Closed oovm closed 2 years ago
Unfortunately this can only be done with quite a bit of hacking currently. Maybe I should implement proper "external function" support.
But if you really want to do it right now, there is a way.
You have to manually create a parse_XID_START function in the same namespace as the compiled grammar (should be easy if you use the macro, a bit harder if you use a buildscript).
Something like the following:
peginate!("
@export
Idents = {idents:Ident};
@string
@no_skip_ws
Ident = (XID_START | '_') {XID_CONTINUE};
");
pub fn parse_XID_START<'a, _CT>(
state: ParseState<'a>,
_tracer: impl ParseTracer,
_cache: &_CT,
) -> ParseResult<'a, char> {
// Boilerplate
let result = state.s().chars().next().ok_or_else(|| {
state
.clone()
.report_error(ParseErrorSpecifics::Other)
})?;
// Actual business logic
if !result.is_xid_start() {
return Err(state.report_error(ParseErrorSpecifics::Other));
}
// More boilerplate
// We are skipping a full character, so we should be OK.
let state = unsafe { state.advance(result.len_utf8()) };
Ok(ParseOk {
result,
state,
farthest_error: None,
})
}
pub fn parse_XID_CONTINUE<'a, _CT>(
state: ParseState<'a>,
_tracer: impl ParseTracer,
_cache: &_CT,
) -> ParseResult<'a, char> {
// Boilerplate
let result = state.s().chars().next().ok_or_else(|| {
state
.clone()
.report_error(ParseErrorSpecifics::Other)
})?;
// Actual business logic
if !result.is_xid_start() {
return Err(state.report_error(ParseErrorSpecifics::Other));
}
// More boilerplate
// We are skipping a full character, so we should be OK.
let state = unsafe { state.advance(result.len_utf8()) };
Ok(ParseOk {
result,
state,
farthest_error: None,
})
}
#[test]
fn test_macro() {
let s = Idents::parse("xyz áé8").unwrap();
assert_eq!(s.idents, vec!["xyz", "áé8"]);
}
I understand the above is not convenient. What if I implemented a syntax like this:
@custom_char(crate::some_module::check_xid)
XID_START;
And then in some_module.rs, you could have a function like this:
fn check_xid(char) -> bool {
char.is_xid_start
}
Maybe even use the unicode_xid directly:
@custom_char(unicode_xid::UnicodeXID::is_xid_continue)
XID_CONTINUE;
Would it fit your use-case?
This hacking meets my needs.
If it were to stabilize as a feature I would like to be
@custom_char(char_xid_start) // advance 1 char
XID_START = 'ANY'; // annotative description, do not use
@custom_string(keyword_where, 5) // advance 5 chars
WHERE = 'case insensitive where'; // annotative description, do not use
@check_string(keyword_checker)
KEYWORD = Ident; // Requires successful capture of Ident and keyword_checker to return true
with function signature
fn char_xid_start(char) -> bool;
fn keyword_where(&str) -> bool;
fn keyword_checker(&str) -> bool;
The syntax I'm currently thinking about is:
@char
@check(unicode_xid::UnicodeXID::is_xid_continue)
XID_START = char; # In this case "char" is actually used
@extern(crate::keyword_where -> String)
WHERE; # no body, prefer comments
@check(crate::keyword_check)
KEYWORD = Ident;
There would be two new additions:
@check directive
The function gets whatever the rule spits out (char in case of @char rules, strings or structs in case of string or struct rules), and should return a bool.
So fn char_xid_start(char) -> bool
and fn keyword_checker(&str) -> bool
fits here, but you could also do checks on more complex structures with multiple fields in the middle of parsing.
@extern directive
It is a completely external parse function with the following signature:
fn custom_fn(&str) -> Result<(T, usize), &'static str>
If the string can be parsed OK, you return a tuple with the result, and the amount of bytes (!) the parser consumed from the input, wrapped in OK. If it cannot be parsed according to the rule, you return a static error message string wrapped in Err.
In case of the keyword where, it would probably look something like this:
fn keyword_where(&str) -> Result<(String, usize), &'static str> {
if str.to_uppercase() == "WHERE" {
let result = str.chars().take(5).collect();
Ok((result, result.len()))
} else {
Err("Expected string 'where' (case insensitive)
}
}
Or you could also return () or a named empty struct for efficiency.
It could also be used to parse numbers in place with something like fn number(&str) -> Result<(i64, usize), &'static str>
You could also do the requested r#"-string feature. In that case you would return the parsed string literal, but skip the starting and ending ##-s. (I really don't want to implement the stack, I think it's not a good addition to PEGs)
Any comments?
By the way, is the case insensitive match common?
Because I think adding a case insensitive string literal and char literal shouldn't be a big problem (the biggest problem is coming up with a good syntax for it).
Please see if the newly added features satisfy your needs. If so, I'll close the issue.
Good, this approach is very scalable.
My identifiers are defined as follows:
where
XID_START
represents an external function UnicodeXID::is_xid_start.How should I capture my
Ident
token?