"utf8" module supporting matching UTF-8/returning &str

mcclure commented 1 year ago

I find "pom" very enjoyable to use but I find I have frustration around converting inputs and match-strings to/from UTF-8 &str (see #53). I think pom adding explicit support for UTF-8 would bring important advantages:

UX improvements when working with Rust strings and chars
Match primitives that guarantee at each step a valid UTF-8 string is being matched, for example an any() that matches UTF-8 chars only (yes, I know I can .convert(str::from_utf8) and it will correctly reject invalid UTF-8, but that bails out relatively late)

This is a draft/first attempt at a utf8 module. (The regular parser is unchanged, utf8 is opt-in.) You can see what using it is like in the example examples/utf8.rs but it's much like normal pom. (.parse() still requires the input to be :as_bytes()ed, but seq() accepts normal Rust strings). The basic approach is

use pom::utf8::* contains functions that have the same names and usage as the ones in pom::parser::* (so it is mostly a drop-in replacement), but any returns or arguments that are &[I] in parser::Parser are &str in utf8::Parser.
pom::utf8::Parser<'a, O> is implemented as a thin wrapper around pom::parser::Parser<'a, u8, O>— it is a separate type because by keeping track of which patterns are pure utf8, collect() over a tree of utf8::Parsers can return a &str safely. But because at core it's still just parser::Parser<_, u8,_>, it can be combined into a single pattern with non-UTF8 parser::Parser (at the cost of no longer being able to do a collect() without re-verifying UTF-8).

This prototype has just enough functions to implement the examples/utf8.rs example. It implements UTF-8 aware seq() and any() combinators, has the UTF-8 aware collect and convert, you can turn a utf8 Parser into parser::Parser with from/into, and it so far has methods passing discard, map, parse, repeat, | and * on to the underlying parser::Parser implementation.

Next steps are:

Implement rest of parser:: functions/methods (I may do this with a macro? I think I would have to write the macro myself. There are some delegation macro crates but none of them seem exactly fit to this situation.)
sym needs to be special because this is the one function I intend to use a slightly different interface from parser::Parser:
- pub fn sym<'a>(tag: char) -> Parser<'a, &'a str> will return a single-char string
- pub fn sym_char<'a>(tag: char) -> Parser<'a, char> will return a parsed char, to make constructions like sym_char(ch).is_a(str::is_alphabetic) possible
The utf8 module uses unsafe {} because it calls str::from_utf8_unchecked on slices it has already confirmed contain complete UTF-8 characters. I would like to introduce a Cargo "feature" to remove use of unsafe from utf8, at the cost of a redundant str::from_utf8 check in places.
May create a utf8::Parser.parse_str(input:&str) that just calls parse(input:as_bytes()), for convenience (?)
Versions of +, - etc that take one parser::Parser and one utf8::Parser and return a parser::Parser, to make it easy to mix them; also I want to create an examples/utf8_mixed.rs demonstrating using parser::Parser and utf8::Parser in the same pattern (EG a simple MsgPack parser or something).

Long term additions I'd be interested in attempting are:

Possibly a Unicode version of pom::char_class?
Possibly glyph support, or support for normalization forms (this would make possible things like seq_case_insensitive which would be very useful to me)

What I need to know from @J-F-Liu:

Are you interested in this PR, in theory? If you do not think this belongs in pom, I will probably publish it as a second supplementary crate.
Should utf-8 support be behind a "feature" so it can be disabled? It does introduce complexities such as external crate support (it uses bstr) and unsafe.
The type of utf8::Parser is Parser<'a, O>. This makes sense because by definition it can only ever work on u8, but means mixing fns that define utf8::Parsers and parser::Parsers in the same file would be a little confusing because some functions would have 2 generic arguments and some would have 3. Would it make sense to put the I type argument back in with a where I=u8, and require the user to type the u8 generic argument every time? (My vote is no, it's fine the way it is now, but I wanted to ask.)

Thank you for this neat library! I have used it a lot this month.

J-F-Liu commented 1 year ago

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

mcclure commented 1 year ago

It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like core::str::next_code_point, we can modify the code to return both the decoded char, and the number of bytes of this char.

I am currently using use bstr::decode_utf8; for this purpose and it seems to work very well, it returns both size and char and it works on slices (next_code_point requires an iterator). It is also safe (although probably the bstr-internal implementation makes use of unsafe). It did mean bringing in bstr.

I think even if we take a utf-8 string directly as the input, it is adequate to use bytes internally (IE take utf-8 string as input and call as_bytes immediately). Although if we operated on &str throughout it might allow removing some or all of the unsafes if that matters.

In my research it appears the number of bytes in a char is predictable because Rust rejects overlong-encoded UTF-8 characters as invalid. But I still feel safer that bstr::decode_utf8 returns a byte count.

J-F-Liu commented 1 year ago

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

mcclure commented 1 year ago

Well, bstr::decode_utf8 already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should be pub fn any<'a>() -> Parser<'a, char>.

Do you have an opinion about the return type of sym()? Also char then?

By the way, here is something I am still confused about. Let's say I run

any().repeat(1..).collect() or any().repeat(1..).discard()

Say it matches 864 characters. In either case, will the repeat() wind up creating a vec of 864 chars and then return them, only for them to immediately be thrown away? Is this a potential performance issue? Or will the compiler notice the result is thrown away on the .collect() chain and eliminate that code?

J-F-Liu commented 1 year ago

Yes, sym() also return char. I'm not sure about compiler optimization, take(n) or skip(n) maybe better in this case.

mcclure commented 1 year ago

Thanks. I will update when I have a fuller implementation.

I will not worry about the compiler optimization question further for now because this problem is also present in the parser::Parser version anyhow.

mcclure commented 1 year ago

Hm, "parser::tag" is not documented in https://crates.io/crates/pom and I'm a little unclear what its function is... Am I correct it matches only on inputs which are slices of char arrays, IE, I=char?

I think there is no relevant way to implement this function in the utf8 module because it's a special function for a special case the utf8 function will not hit, and I should just skip it. Is this correct?

mcclure commented 1 year ago

I am also a little bit confused by the "shr" operation. The doc on crates.io says "Parse p and get result P, then parse q and return result of q(P).", which implies to me that parsers p and q both parse as-is, but from reading the code it looks like q is a function that returns a parser, which I guess then runs. Which of these is correct?

(If that second thing is how it works (the parser is result_of_p(q), not q) that's very useful because it makes it possible to do things like have "p" return a number of bytes and that get passed to take().)

mcclure commented 1 year ago

The PR now has feature parity with parser::Parser, the only thing I think holding this back from a potential merge is writing some test cases (I have not tested it other than the utf8 and utf8_mixed examples, and a quick test with shr).

Other than that, I think the following are good ideas, but I would suggest leaving them to a followup PR:

~~A take_bytes / skip_bytes function~~ EDIT: I just did this one
As mentioned above, ~~glyphs~~ graphemes and normalized forms
As mentioned above, utf8::char_class
As mentioned above, a "super_safe" feature that eliminates use of unsafe
pos() gives a byte offset; a pos_char() giving a character position would be very interesting (this might be a little bit difficult)

J-F-Liu commented 1 year ago

There is a usage of shr in https://github.com/J-F-Liu/lopdf/blob/master/src/parser.rs#L131

J-F-Liu / pom

"utf8" module supporting matching UTF-8/returning &str #59