J-F-Liu / pom

PEG parser combinators using operator overloading without macros.
MIT License
496 stars 30 forks source link

"utf8" module supporting matching UTF-8/returning &str #59

Closed mcclure closed 1 year ago

mcclure commented 1 year ago

I find "pom" very enjoyable to use but I find I have frustration around converting inputs and match-strings to/from UTF-8 &str (see #53). I think pom adding explicit support for UTF-8 would bring important advantages:

This is a draft/first attempt at a utf8 module. (The regular parser is unchanged, utf8 is opt-in.) You can see what using it is like in the example examples/utf8.rs but it's much like normal pom. (.parse() still requires the input to be :as_bytes()ed, but seq() accepts normal Rust strings). The basic approach is

This prototype has just enough functions to implement the examples/utf8.rs example. It implements UTF-8 aware seq() and any() combinators, has the UTF-8 aware collect and convert, you can turn a utf8 Parser into parser::Parser with from/into, and it so far has methods passing discard, map, parse, repeat, | and * on to the underlying parser::Parser implementation.

Next steps are:

Long term additions I'd be interested in attempting are:

What I need to know from @J-F-Liu:

Thank you for this neat library! I have used it a lot this month.

J-F-Liu commented 1 year ago

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

mcclure commented 1 year ago

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

I am currently using use bstr::decode_utf8; for this purpose and it seems to work very well, it returns both size and char and it works on slices (next_code_point requires an iterator). It is also safe (although probably the bstr-internal implementation makes use of unsafe). It did mean bringing in bstr.

I think even if we take a utf-8 string directly as the input, it is adequate to use bytes internally (IE take utf-8 string as input and call as_bytes immediately). Although if we operated on &str throughout it might allow removing some or all of the unsafes if that matters.

In my research it appears the number of bytes in a char is predictable because Rust rejects overlong-encoded UTF-8 characters as invalid. But I still feel safer that bstr::decode_utf8 returns a byte count.

J-F-Liu commented 1 year ago

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

mcclure commented 1 year ago

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

Do you have an opinion about the return type of sym()? Also char then?


By the way, here is something I am still confused about. Let's say I run

any().repeat(1..).collect() or any().repeat(1..).discard()

Say it matches 864 characters. In either case, will the repeat() wind up creating a vec of 864 chars and then return them, only for them to immediately be thrown away? Is this a potential performance issue? Or will the compiler notice the result is thrown away on the .collect() chain and eliminate that code?

J-F-Liu commented 1 year ago

Yes, sym() also return char. I'm not sure about compiler optimization, take(n) or skip(n) maybe better in this case.

mcclure commented 1 year ago

Thanks. I will update when I have a fuller implementation.

I will not worry about the compiler optimization question further for now because this problem is also present in the parser::Parser version anyhow.

mcclure commented 1 year ago

Hm, "parser::tag" is not documented in https://crates.io/crates/pom and I'm a little unclear what its function is... Am I correct it matches only on inputs which are slices of char arrays, IE, I=char?

I think there is no relevant way to implement this function in the utf8 module because it's a special function for a special case the utf8 function will not hit, and I should just skip it. Is this correct?

mcclure commented 1 year ago

I am also a little bit confused by the "shr" operation. The doc on crates.io says "Parse p and get result P, then parse q and return result of q(P).", which implies to me that parsers p and q both parse as-is, but from reading the code it looks like q is a function that returns a parser, which I guess then runs. Which of these is correct?

(If that second thing is how it works (the parser is result_of_p(q), not q) that's very useful because it makes it possible to do things like have "p" return a number of bytes and that get passed to take().)

mcclure commented 1 year ago

The PR now has feature parity with parser::Parser, the only thing I think holding this back from a potential merge is writing some test cases (I have not tested it other than the utf8 and utf8_mixed examples, and a quick test with shr).

Other than that, I think the following are good ideas, but I would suggest leaving them to a followup PR:

J-F-Liu commented 1 year ago

There is a usage of shr in https://github.com/J-F-Liu/lopdf/blob/master/src/parser.rs#L131