com-lihaoyi / fastparse

Writing Fast Parsers Fast in Scala
https://com-lihaoyi.github.io/fastparse
MIT License
1.09k stars 164 forks source link

Support unicode escapes #2

Closed lihaoyi closed 9 years ago

lihaoyi commented 9 years ago

They're dumb but we probably have to support them

@sirthias is there any way to override the parsing over every single character or string to check for these silly \u0123 thingies? Maybe by modifying my current wspStr and wspChar thingies? I suppose I'd need to get rid of anyOf or noneOf because those don't support the stupid unicode escapes either.

I don't want to do a pre-processing stage if I can reasonably avoid it. Preprocessing will destroy all the source locations and require elaborate gymnastics to get them back.

sirthias commented 9 years ago

I agree that we should try to get away w/o preprocessing if at all possible.

As to anyOf and noneOf: If these only contain 7-bit ASCII chars you should replace them with CharPredicate instances defined on the companion object. Performance will be much better. See the CharacterClasses definition in the akka-http header parser for inspiration.

is there any way to override the parsing over every single character or string

Yes, you can override the implicit converstion from String and/or Char. See the Handling Whitespace section of the README for an example.

propensive commented 9 years ago

This might be difficult to get right given that Unicode escaping is a pre-process step in the compiler. How would you parse this, for example?

"\u005c"

Scalac equates it to a single backslash before parsing, which escapes the "closing" double-quote, and thus it fails to parse.

Though "\u005c" parses just fine, and is equal to """\u005c""". But I can only say that because Unicode escapes are the only kinds of escape that work inside triple-quoted strings...

Incidentally, Erik Osheim did some work removing Unicode escaping as a pre-processor step in the compiler, and moving it into the single-quoted string parser to try to remove most of the unintuitive corner cases. This should make it into the Typelevel fork at some point...

Cheers, Jon

On 29 November 2014 at 20:51, Mathias notifications@github.com wrote:

I agree that we should try to get away w/o preprocessing if at all possible.

As to anyOf and noneOf: If these only contain 7-bit ASCII chars you should replace them with CharPredicate instances defined on the companion object. Performance will be much better. See the CharacterClasses definition https://github.com/akka/akka/blob/release-2.3-dev/akka-http-core/src/main/scala/akka/http/model/parser/CharacterClasses.scala in the akka-http header parser for inspiration.

is there any way to override the parsing over every single character or string

Yes, you can override the implicit converstion from String and/or Char. See the Handling Whitespace https://github.com/sirthias/parboiled2#handling-whitespace section of the README for an example.

— Reply to this email directly or view it on GitHub https://github.com/lihaoyi/scala-parser/issues/2#issuecomment-64964817.

Jon Pretty | @propensive

lihaoyi commented 9 years ago

Maybe the right thing to do is to preprocess unicode escapes, and purposely leave the source positions all wrong.

sirthias commented 9 years ago

If you implement that simple pre-processing at the ParserInput level building a simple translation map you might be able to get the cake and eat it too.

paulp commented 9 years ago

My approach was to honor unicode escapes where I must (that is, it won't parse if I don't) and ignore them otherwise (in strings and comments.) That's much closer to the behavior I think is sane, and I am doubtful the power of unicode escapes to open and close strings and comments is something which requires support.

Not claiming this is especially performant or anything, just a point of reference. https://github.com/paulp/scala-parser/blob/0a1e476c712d2ba/parser/src/main/scala/Basic.scala#L24

lihaoyi commented 9 years ago

Unicode escapes are now supported in strings. I'm going to just punt on this in general as a #wontfix. In all the dozen projects I parsed, I think I found exactly 4 unicode escapes that dont fall in a string, all of which are in Scalac test files. Not worth my time to support this ^_^

propensive commented 9 years ago

Are they supported in characters, e.g. '\u0000'? That, I think, would clear up all of the other useful cases.