Closed mcclure closed 1 year ago
It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char.
An efficient implementation is something like core::str::next_code_point
, we can modify the code to return both the decoded char, and the number of bytes of this char.
It's a good idea to use UTF-8 string directly as the input, then advance the input position char by char. An efficient implementation is something like
core::str::next_code_point
, we can modify the code to return both the decoded char, and the number of bytes of this char.
I am currently using use bstr::decode_utf8;
for this purpose and it seems to work very well, it returns both size and char and it works on slices (next_code_point requires an iterator). It is also safe (although probably the bstr-internal implementation makes use of unsafe). It did mean bringing in bstr.
I think even if we take a utf-8 string directly as the input, it is adequate to use bytes internally (IE take utf-8 string as input and call as_bytes immediately). Although if we operated on &str throughout it might allow removing some or all of the unsafes if that matters.
In my research it appears the number of bytes in a char is predictable because Rust rejects overlong-encoded UTF-8 characters as invalid. But I still feel safer that bstr::decode_utf8 returns a byte count.
Well, bstr::decode_utf8
already does this.
It's ok to define the type of utf8::Parser as Parser<'a, O>.
The type of any should be pub fn any<'a>() -> Parser<'a, char>
.
Well,
bstr::decode_utf8
already does this. It's ok to define the type of utf8::Parser as Parser<'a, O>. The type of any should bepub fn any<'a>() -> Parser<'a, char>
.
Do you have an opinion about the return type of sym()
? Also char
then?
By the way, here is something I am still confused about. Let's say I run
any().repeat(1..).collect()
or
any().repeat(1..).discard()
Say it matches 864 characters. In either case, will the repeat()
wind up creating a vec of 864 chars and then return them, only for them to immediately be thrown away?
Is this a potential performance issue?
Or will the compiler notice the result is thrown away on the .collect() chain and eliminate that code?
Yes, sym()
also return char
.
I'm not sure about compiler optimization, take(n)
or skip(n)
maybe better in this case.
Thanks. I will update when I have a fuller implementation.
I will not worry about the compiler optimization question further for now because this problem is also present in the parser::Parser
version anyhow.
Hm, "parser::tag" is not documented in https://crates.io/crates/pom and I'm a little unclear what its function is... Am I correct it matches only on inputs which are slices of char arrays, IE, I=char?
I think there is no relevant way to implement this function in the utf8 module because it's a special function for a special case the utf8 function will not hit, and I should just skip it. Is this correct?
I am also a little bit confused by the "shr" operation. The doc on crates.io says "Parse p and get result P, then parse q and return result of q(P).", which implies to me that parsers p and q both parse as-is, but from reading the code it looks like q is a function that returns a parser, which I guess then runs. Which of these is correct?
(If that second thing is how it works (the parser is result_of_p(q), not q) that's very useful because it makes it possible to do things like have "p" return a number of bytes and that get passed to take().)
The PR now has feature parity with parser::Parser, the only thing I think holding this back from a potential merge is writing some test cases (I have not tested it other than the utf8 and utf8_mixed examples, and a quick test with shr).
Other than that, I think the following are good ideas, but I would suggest leaving them to a followup PR:
There is a usage of shr in https://github.com/J-F-Liu/lopdf/blob/master/src/parser.rs#L131
I find "pom" very enjoyable to use but I find I have frustration around converting inputs and match-strings to/from UTF-8
&str
(see #53). I think pom adding explicit support for UTF-8 would bring important advantages:any()
that matches UTF-8 chars only (yes, I know I can.convert(str::from_utf8)
and it will correctly reject invalid UTF-8, but that bails out relatively late)This is a draft/first attempt at a
utf8
module. (The regular parser is unchanged, utf8 is opt-in.) You can see what using it is like in the example examples/utf8.rs but it's much like normal pom. (.parse()
still requires the input to be:as_bytes()
ed, butseq()
accepts normal Rust strings). The basic approach isuse pom::utf8::*
contains functions that have the same names and usage as the ones inpom::parser::*
(so it is mostly a drop-in replacement), but any returns or arguments that are&[I]
inparser::Parser
are&str
in utf8::Parser.pom::utf8::Parser<'a, O>
is implemented as a thin wrapper aroundpom::parser::Parser<'a, u8, O>
— it is a separate type because by keeping track of which patterns are pure utf8,collect()
over a tree ofutf8::Parser
s can return a&str
safely. But because at core it's still justparser::Parser<_, u8,_>
, it can be combined into a single pattern with non-UTF8parser::Parser
(at the cost of no longer being able to do acollect()
without re-verifying UTF-8).This prototype has just enough functions to implement the examples/utf8.rs example. It implements UTF-8 aware
seq()
andany()
combinators, has the UTF-8 awarecollect
andconvert
, you can turn a utf8 Parser intoparser::Parser
withfrom
/into
, and it so far has methods passingdiscard
,map
,parse
,repeat
,|
and*
on to the underlyingparser::Parser
implementation.Next steps are:
parser::
functions/methods (I may do this with a macro? I think I would have to write the macro myself. There are some delegation macro crates but none of them seem exactly fit to this situation.)sym
needs to be special because this is the one function I intend to use a slightly different interface fromparser::Parser
:pub fn sym<'a>(tag: char) -> Parser<'a, &'a str>
will return a single-char stringpub fn sym_char<'a>(tag: char) -> Parser<'a, char>
will return a parsed char, to make constructions likesym_char(ch).is_a(str::is_alphabetic)
possibleunsafe {}
because it callsstr::from_utf8_unchecked
on slices it has already confirmed contain complete UTF-8 characters. I would like to introduce a Cargo "feature" to remove use of unsafe from utf8, at the cost of a redundant str::from_utf8 check in places.utf8::Parser.parse_str(input:&str)
that just callsparse(input:as_bytes())
, for convenience (?)+
,-
etc that take oneparser::Parser
and oneutf8::Parser
and return aparser::Parser
, to make it easy to mix them; also I want to create an examples/utf8_mixed.rs demonstrating usingparser::Parser
andutf8::Parser
in the same pattern (EG a simple MsgPack parser or something).Long term additions I'd be interested in attempting are:
What I need to know from @J-F-Liu:
bstr
) andunsafe
.utf8::Parser
isParser<'a, O>
. This makes sense because by definition it can only ever work on u8, but means mixingfn
s that defineutf8::Parser
s andparser::Parser
s in the same file would be a little confusing because some functions would have 2 generic arguments and some would have 3. Would it make sense to put theI
type argument back in with awhere I=u8
, and require the user to type theu8
generic argument every time? (My vote is no, it's fine the way it is now, but I wanted to ask.)Thank you for this neat library! I have used it a lot this month.