Add mechanism for processing invalid XML names (transforming to valid ones)

mensinda commented 2 years ago

This commit introduces the PROCESS_ESCAPED_MALFORMED_TAGS and ESCAPE_MALFORMED_TAGS features that control whether invalid tag names will be escaped with an attribute.

fixes #523 fixes #524

cowtowncoder commented 2 years ago

First of all, thank you for contributing this PR.

But as to proposed changes.. Hmmh. I really don't like the approach of magic names like this. While I understand it could be used to retain information it results in kind of special-purpose XML only processable by Jackson. I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

mensinda commented 2 years ago

While I understand it could be used to retain information

Retaining information is an absolute must-have in my use case (serializing and deserializing data).

I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

So, basically something like my base32 proposal from #523? Or do you mean something that isn't reversible like replacing those chars with _?

mensinda commented 2 years ago

If you also don't like something like base32 tags, would it be OK to have the default being replacing invalid chars with _ and then having an option that either does the current logic from this PR or base32? I really do need 100% reversibility and would like to have an upstream solution for this...

mensinda commented 2 years ago

I have refactored the PR a bit, and now there is support for multiple strategies for escaping tags:

Key	Description	Why
`NONE`	No escaping - can produce invalid XML	Backwards compatibility
`REPLACE`	Replaces invalid characters with `_`	No jackson specific magic in the output
`ATTRIBUTE_ESCAPE`	The real tag name in an attribute	Valid XML and the tags are human readable
`BASE64`	base64url encoded tag with a `base64_tag_` prefix.	Valid XML, 100% reversible, and no magic attributes.

Each strategy has its pros and cons. So instead of us choosing what tradeoffs to make, why not let the user decide?

mensinda commented 2 years ago

Anything that else that is still needed? Is it OK to include multiple strategies for escaping tag names or do you only want one that can be toggled with a flag?

cowtowncoder commented 2 years ago

I have nothing against configuration, but I am not going to accept structure-changing modifications (adding wrapping element, additional attributes and so on). These will be unlikely to work reliably and will end up maintenance nightmares. I will also not consider 100% information retaining a goal in the sense that one could do transformations only using XML content -- XML module is built to assume there is certain amount of property metadata that comes from Java types (POJOs). This is why name-mangling is acceptable: binding to/from POJO properties can be made even if name transformation itself was lossy. All that is needed is that there is a one-way transformation from logical property name (which may not be valid XML name) into valid XML name; reverse transform is not needed. This is how XML module is designed to work: it may not be what you would prefer but it is the design that I will follow.

Having said all of that I am not against one-way transformations that also happen to be reversible. This could include base64 (and similar) encoding with prefix. SInce this is not a structural transformation and can work with existing handlers it is acceptable to me.

mensinda commented 2 years ago

So, for this PR specifically, do you want me to just drop the ATTRIBUTE_ESCAPE in this case?

cowtowncoder commented 2 years ago

Sigh. I am not sure I will want to go with this approach at all. My main concern is complexity it adds to already fragile processing. If I was to go with this approach, yes, dropping ATTRIBUTE_ESCAPE would be needed.

I guess backtracking a bit, the way I would see transformations would not operate at streaming level at all, but at property handling (databind). That's where it is possible (and necessary) to use one-way transform from logical property name to physical to match. This leaves streaming parser/generator as-is without knowing anything about changes. One challenging case there would be that of wrappers, I think, but other than that modifying property names with Bean[De]SerializerModifier would be relatively straight-forward.

Translating things at lower level would have some benefits but my worry is that handling is already quite convoluted. I see why reversible transformation would be necessary there. But doing it there also adds significant overhead, per-element/attribute processing that is not needed if bindings are defined at databind level, where (de)serializers are constructed ones and name translation similarly.

So: I don't think I will accept approach as a whole as defined here.

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible. This may even include out-of-the-box implementations; but also needs to allow custom implementation of converter(s). So instead of Enum selection, there'd have to be something like XmlNameConverter to be implemented by user (but possibly also with one or more standard implementations). It would be ok to have default "no-op" implementation invoked.

... but not sure if this can be done without adding overhead of keeping QName and so on. That is another part I would want to avoid.

mensinda commented 2 years ago

So: I don't think I will accept approach as a whole as defined here.

Fine by me, bit (I know we have different priorities here) in the end I need something where I can put in an arbitrary map convert it to XML and get the same map back out. Ideally, I would also like to avoid having to maintain a proprietary fork :)

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible.

Would that be something dataformat XML specific or would be a generic data-bind solution preferred?

... but not sure if this can be done without adding overhead of keeping QName and so on. That is another part I would want to avoid.

I can refactor this PR into a generic extension point (I am also fine with dropping support for extra magic attributes), but could you then specify more clearly what would be acceptable and what not? Do you just want me to just map on Strings or do you want me to convert Strings to QName and vice versa?

cowtowncoder commented 2 years ago

@mensinda So: extension I was thinking of would be XML-specific, and registered similar to what PR does but instead of pre-defined enum, with actual (stateless) handler.

One tricky part is that I do not want additional conversions to/from QName in case handler is not registered (or default no-op one is used). So ideally default case without new handler would not have additional overhead. I am fine having some sort of opaque state/storage if need be, for handler to provide/take, if that is needed. Just not additional processing when custom handler not needed.

Does this make more sense?

As to databind-level approach: that could be pursued separately and would probably just allow replacing all invalid-in-QName characters with underscore (or maybe some other configurable character). I think that is not something that would work for your use case.

mensinda commented 2 years ago

@cowtowncoder I have updated the PR to use a new extension point (XmlTagProcessor) as requested.

mensinda commented 2 years ago

ping :)

cowtowncoder commented 2 years ago

Hi there! Sorry, haven't had any time to look into this. It's on my list, hoping to get back to it in near future.

mensinda commented 1 year ago

Hi, because of the timeline "The plan is to get the first Release Candidate (2.14.0-rc1) out during August 2022" from https://cowtowncoder.medium.com/jackson-2-14-sneak-peek-79859babaa4, I was wandering how likely it would be that this PR could go into the 1.14 release?

cowtowncoder commented 1 year ago

@mensinda I have my long (but shortening slowly) list of things to work through prior to release; this is an entry. So timing of RC1 may well move but I will have a look here before that, or at least final release (possible to have multiple RCs, even with new features).

cowtowncoder commented 1 year ago

@mensinda Hi there! Apologies for this taking so long but I FINALLY had a chance to go back and read the PR. I like the approach and hope we can get it in 2.14.

I will be adding some smaller notes as comments but I have only one bigger thing I'd like to change: avoiding construction of XmlTag container for every element. I think there are 3 possibilities for this:

Since this is sort of "internal" value, could just change it to be mutable value class, passed by stream to processor and processor can modify local name and/or namespace URI as it sees fit.
Or avoid helper class completely: either by only allowing change of local name (or is namespace URI needed as marker in some cases?), or by having separate calls.
Or make tag processor optional so instead of "no-op" default instance, leave it as null and only call mutation methods if non-null one defined

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String. But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I know this may sound like over-optimizing but I hope this can be done since most use cases will probably not configure mutation so allocations are unnecessary overhead.

cowtowncoder commented 1 year ago

@mensinda I think I could just merge this and make change I want (wrt mutability of tag info to avoid construction of instances). But I realized there is one practical thing to do first if we haven't done it yet (apologies if we did and I just forgot): I need the CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(there is also alternate Corporate CLA if individual one linked above doesn't work)

and it's a one-time thing (good for all future contributions). The easiest way is usually to print it, fill, sign & scan/photo, then email to info at fasterxml dot com. There are other possibilities (if you can't scan, modifying PDF with info + name as signature) too.

I would really like to get this in the first 2.14.0-rc1 if possible!

mensinda commented 1 year ago

Thanks for the review, but I am currently on vacation. I can send you the CLA this Friday if that is enough...

On 6 September 2022 02:04:31 CEST, Tatu Saloranta @.> wrote: @. I think I could just merge this and make change I want (wrt mutability of tag info to avoid construction of instances). But I realized there is one practical thing to do first if we haven't done it yet (apologies if we did and I just forgot): I need the CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(there is also alternate Corporate CLA if individual one linked above doesn't work)

and it's a one-time thing (good for all future contributions). The easiest way is usually to print it, fill, sign & scan/photo, then email to info at fasterxml dot com. There are other possibilities (if you can't scan, modifying PDF with info + name as signature) too.

I would really like to get this in the first 2.14.0-rc1 if possible!

-- Reply to this email directly or view it on GitHub: https://github.com/FasterXML/jackson-dataformat-xml/pull/531#issuecomment-1237535579 You are receiving this because you were mentioned.

Message ID: @.***> -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

cowtowncoder commented 1 year ago

@mensinda No problem, that's fine & thank you for the quick reply. It's OSS so availability expected to be variable.

cowtowncoder commented 1 year ago

@mensinda Ah. If only single copy created per-mapper yeah no need to optimize, ignore that suggestion.

As to other suggestions I guess my thinking was that if new methods were to be added, it'd be easier to have a "base" implementation (empty). But with just 2 methods maybe that's overthinking things. It does not look like this interface was likely to need expansion; and if it does, can use default method implementations for backwards compatibility.

cowtowncoder commented 1 year ago

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String. But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I, don't see any reason for or against processing the URI. I personally don't have a use case for this (and can think of one, to be honest), but it was easy to include and someone might have a use case for it, so I included it.

Would you be OK with just dropping URI processing for now?

I think that actually my preferred choice is to make XmlTag mutable, simple value class. So caller sets up namespace and local name; calls processor; uses values it finds. Just working around the lack of return Tuples in Java. I do think it plausible that someone might want to base (part of) processing on namespace URI in future.

cowtowncoder commented 1 year ago

@mensinda I think that I can easily make the minor change wrt mutability after merging. So given that I think you sent CLA (will check that), I think we are good now.

Thank you for this contribution!

cowtowncoder commented 1 year ago

Merged, will change XmlTagName handling.

I also realized that ReplaceTagProcessor Javadoc needs some changing: it does not really replace only invalid characters but rather more -- all non-ASCII character, for example. That is fine for many users but not for international users. This is fine as long as explanation is accurate.

cowtowncoder commented 1 year ago

... and I don't think this PR handles attribute names either (should have spent bit more time reading details). I'll add that.

FasterXML / jackson-dataformat-xml

Add mechanism for processing invalid XML names (transforming to valid ones) #531