When the input comes from a byte array that contains an invalid utf-8 sequence, the parser seems to silently convert that sequence into a different Char value. I would like to be able to detect that the input contains invalid utf-8 sequences. Where can I specify the input encoding? I could not find any options for fastparse about that.
Here is a working test:
import fastparse._, NoWhitespace._
val input = val input = Array(0x20.toByte, 0xED.toByte, 0xA0.toByte, 0x80.toByte)
def grammar[$: P] = P(SingleChar.rep)
val result = parse(input, grammar(_))
assert(result.get.value == Seq(32.toChar, 65533.toChar))
The byte sequence 0xEDA080 is not a valid UTF-8 character. How can I detect that?
I did not expect to obtain the Char value 65533 here. It seems that the decoder inserts this character to signify an invalid code sequence (this is a "replacement" character). Is there any way for me to override this behavior?
When the input comes from a byte array that contains an invalid utf-8 sequence, the parser seems to silently convert that sequence into a different
Char
value. I would like to be able to detect that the input contains invalid utf-8 sequences. Where can I specify the input encoding? I could not find any options forfastparse
about that.Here is a working test:
The byte sequence
0xEDA080
is not a valid UTF-8 character. How can I detect that?I did not expect to obtain the
Char
value65533
here. It seems that the decoder inserts this character to signify an invalid code sequence (this is a "replacement" character). Is there any way for me to override this behavior?