FasterXML / woodstox

The gold standard Stax XML API implementation. Now at Github.
Apache License 2.0
220 stars 81 forks source link

Add support to optionally allow surrogate pair entities (#165) #174

Closed Magmaruss closed 5 months ago

Magmaruss commented 1 year ago

This feature is an answer for the issue #165. It's extending the XML reader functionality with a new configuration option (default disabled) to support reading XML with unicode characters written using surrogate pair (UTF-16 encoding). This feature is very important to correctly read XML from the legacy services which produces responses this way and there is no way to change the behavior of them.

Example: The external service produces a response with a unicode character written using UCS2:

<response>
    <value>Merry Christmas &#55356;&#57221;</value>
</response>

Default behavior of XMLStreamReader is throwing an exception with message:

Illegal character entity: expansion character (code 0xd83c

After setting the P_RESOLVE_ENTITY_SURROGATE_PAIRS parameter to true it will be readed with no exception converting the surrogate pair entities to the appropriate unicode (🎅)

cowtowncoder commented 1 year ago

First of all: thank you for contributing this feature.

Second: could you add a description that explains why this is needed, what is the use case (ideally with a simple example) and so on?

Magmaruss commented 1 year ago

Description added.

cowtowncoder commented 8 months ago

First of all: apologies for ultra-slow follow up here. But I am ready to get this merged.

One minor legal thing first: if we haven't yet gotten a CLA, we'd need one from here (it's under Jackson project, that's fine):

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and it's easiest to print, fill & sign, scan/photo, email to info at fasterxml dot com. (for project you can add Woodstox or FasterXML/Woodstox).

This is only needed once for all contributions to Woodstox, and once I get it, I can proceed with merge. I added some notes already; just need to re-read main logic and should be good to merge.

Thank you again for this contribution!

EDIT: I gave wrong email address -- CLA actually needs to be sent to info at fasterxml dot com instead; cla email alias did not yet exist unlike I thought.

cowtowncoder commented 7 months ago

@Magmaruss quick note: I gave a wrong email address to send CLA to -- CLA actually needs to be sent to info at fasterxml dot com instead; cla email alias did not yet exist unlike I thought. So just in case you already sent one, could you please re-send it. Apologies for mix-up.

cowtowncoder commented 7 months ago

NOTE: CLA received; can now focus on finalizing PR.

cowtowncoder commented 7 months ago

@Magmaruss I hope you don't mind my applying changes I suggested here, wrt naming and adding @since tags. I am open to different naming too if you strongly prefer yet different naming.

Beyond this, I think it'd be good to refactor decoding loop, possibly the way I suggested (but not necessarily, there may well be other ways to improve readability).

Also thank you for getting CLA; I am hoping we can get this PR merged soon!

cowtowncoder commented 7 months ago

@Magmaruss Ok, I did some more tweaks and I think things look good.

About the only thing I think that is needed would be a test or two for invalid cases; one for catching surrogate pairs without enabling their use; and another one for just invalid pairings with handling enabled. I hope to get this merged soon, and then soon after release 6.6.0.

Thank you again for contributing this!

Magmaruss commented 6 months ago

I can write simple test for disabled option, but for now when disabled it's throwing "Illegal character entity..." - default behavior before introducing the feature. Should I replace this exception with something special when first valid high surrogate detected or leave the default behavior and add test only?

cowtowncoder commented 6 months ago

I can write simple test for disabled option, but for now when disabled it's throwing "Illegal character entity..." - default behavior before introducing the feature. Should I replace this exception with something special when first valid high surrogate detected or leave the default behavior and add test only?

Either way works for me: obviously improved error messages are always a plus, but if you want to limit the scope of changes & extra work, verifying default behavior is fine also.

cowtowncoder commented 5 months ago

Hi @Magmaruss! I was wondering if you might be able to add basic unit test? No rush if you don't have time -- I just hope to be able to merge this in, get 6.6.0 released, and then be able to do #134 (increase min JDK baseline to 8 for Woodstox).

Magmaruss commented 5 months ago

Yes, sorry for that, but i am having a very difficult time. I will try to deliver it tonight or tomorrow at the latest

cowtowncoder commented 5 months ago

@Magmaruss Sorry did not mean to add pressure at all. I appreciate your help here and please take your time -- I don't want this to be another stress point.

Magmaruss commented 5 months ago

Test case for disabled option added. Please tell if you need something more.