Open ST-DDT opened 6 years ago
When constructing String
out of broken UTF-8 content, what happens? I am guessing invalid byte gets decoded as "question mark":
https://www.fileformat.info/info/unicode/char/0fffd/index.htm
which will then add garbage to attribute value.
I don't think this is something Woodstox should really be doing. Although I understand it may be inconvenient, I think handling of broken content is something that application needs to configure somehow.
I have to consume a message from a message broker with (sometimes) broken encoding in one of its attributes. (Its from a legacy software that nobody wants/dares to touch.)
Currently when trying to parse the mesages I get the following Exception:
If I use the same bytes in a String directly it works perfectly fine.
It would be nice if I could use an option to allow broken encodings in my Strings instead of Exceptions. (After parsing the input, I usually have enough context to know which messages I have to fix and how)
I use jackson-dataformat-xml 2.9.6 + woodstox 5.0.3/5.1 to parse the message.
Currently I use the following workaround to bypass the issue:
As an alternative I considered using a plain byte solution, but unfortunately the parser still tries to parse the input as String so it can use it with base64 encoding and I did't find a way to tell the parser just give me the bytes without reverse base64 it first.
Code to reproduce
Data class:
Test method:
Output: