com-lihaoyi / fastparse

Writing Fast Parsers Fast in Scala
https://com-lihaoyi.github.io/fastparse
MIT License
1.09k stars 164 forks source link

How to detect invalid utf-8 sequences in scala fastparse? #295

Closed winitzki closed 10 months ago

winitzki commented 1 year ago

When the input comes from a byte array that contains an invalid utf-8 sequence, the parser seems to silently convert that sequence into a different Char value. I would like to be able to detect that the input contains invalid utf-8 sequences. Where can I specify the input encoding? I could not find any options for fastparse about that.

Here is a working test:


    import fastparse._, NoWhitespace._

    val input = val input = Array(0x20.toByte, 0xED.toByte, 0xA0.toByte, 0x80.toByte)

    def grammar[$: P] = P(SingleChar.rep)

    val result = parse(input, grammar(_))
    assert(result.get.value == Seq(32.toChar, 65533.toChar))

The byte sequence 0xEDA080 is not a valid UTF-8 character. How can I detect that?

I did not expect to obtain the Char value 65533 here. It seems that the decoder inserts this character to signify an invalid code sequence (this is a "replacement" character). Is there any way for me to override this behavior?