J-F-Liu / pom

PEG parser combinators using operator overloading without macros.
MIT License
496 stars 30 forks source link

How to use with Unicode? #53

Open mcclure opened 1 year ago

mcclure commented 1 year ago

I recently wrote a small program with pom. I found the API interface lovely, but I found it very hard to get string values into the library. All sample code is written with <u8, T> parsers and the literals are written like b"char". It is clear how to use this with ASCII, but not unicode.

If I try to write my parsers instead as <char, T> then of course parse() cannot accept strings because then pom expects an array of chars and a string is UTF-8 bytes. I can convert the string to an array of chars, but for very long strings this will be inefficient.

I see the convert() function can be used to easily (efficiently?) interpret a string as a sequence of bytes, so maybe it is okay to just use <u8, T>. However, then I have a different problem. What if I want to have unicode literals (maybe sym('🐈'), if for some reason 🐈 is a separator) or unicode ranges (for example codepoint U+1100 to U+11FF [α„€..α‡Ώ])?

J-F-Liu commented 1 year ago

Use Parser<'a, char, O>, the parsing process has many trial and error, needs to go back previous position, so iterator is not enough.