FasterXML / jackson-dataformat-xml

Extension for Jackson JSON processor that adds support for serializing POJOs as XML (and deserializing from XML) as an alternative to JSON
Apache License 2.0
561 stars 221 forks source link

If there are HTML tags within XML tags, @JacksonXmlText will assign incorrect values to the content. #623

Open Suny95 opened 6 months ago

Suny95 commented 6 months ago

If there are HTML tags within XML tags, the Jackson XML parser will assign incorrect values to the content.

image image
@Data
    public static class Abstract {

        @JacksonXmlElementWrapper(useWrapping = false)
        @JacksonXmlProperty(localName = "AbstractText")
        private List<AbstractText> abstractTextList;

    }

    @Data
    public static class AbstractText {

        @JacksonXmlProperty(isAttribute = true)
        private String Label;

        @JacksonXmlProperty(isAttribute = true, localName = "NlmCategory")
        private String category;

        @JacksonXmlText
        private String value;

    }
Suny95 commented 6 months ago

version:'com.fasterxml.jackson.dataformat:jackson-dataformat-xml:2.15.0'

cowtowncoder commented 5 months ago

Although textual description can be helpful, what would be needed would be full (but ideally minimal) reproduction to show exact problem.

ronnoceel commented 2 months ago

I have a working example.

xml:

<Abstract>
   <AbstractText><i>Objective</i>. Holographic mixed reality (HMR) allows for the superimposition of computer-generated virtual objects onto the operator's view of the world. Innovative solutions can be developed to enable the use of this technology during surgery. The authors developed and iteratively optimized a pipeline to construct, visualize, and register intraoperative holographic models of patient landmarks during spinal fusion surgery. <i>Methods.</i> The study was carried out in two phases. In phase 1, the custom intraoperative pipeline to generate patient-specific holographic models was developed over 7 patients. In phase 2, registration accuracy was optimized iteratively for 6 patients in a real-time operative setting. <i>Results.</i> In phase 1, an intraoperative pipeline was successfully employed to generate and deploy patient-specific holographic models. In phase 2, the registration error with the native hand-gesture registration was 20.2 &#xb1; 10.8&#xa0;mm (n = 7 test points). Custom controller-based registration significantly reduced the mean registration error to 4.18 &#xb1; 2.83&#xa0;mm (n = 24 test points, <i>P</i> &lt; .01). Accuracy improved over time (B = -.69, <i>P</i> &lt; .0001) with the final patient achieving a registration error of 2.30 &#xb1; .58&#xa0;mm. Across both phases, the average model generation time was 18.0 &#xb1; 6.1&#xa0;minutes (n = 6) for isolated spinal hardware and 33.8 &#xb1; 8.6&#xa0;minutes (n = 6) for spinal anatomy. <i>Conclusions.</i> A custom pipeline is described for the generation of intraoperative 3D holographic models during spine surgery. Registration accuracy dramatically improved with iterative optimization of the pipeline and technique. While significant improvements and advancements need to be made to enable clinical utility, HMR demonstrates significant potential as the next frontier of intraoperative visualization.</AbstractText>
</Abstract>

Java:

Abstract.java

import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlElementWrapper;

import java.util.List;

public class Abstract {
    @JacksonXmlElementWrapper(useWrapping = false)
    public List<AbstractText> getAbstractText() {
        return this.AbstractText;
    }

    public void setAbstractText(List<AbstractText> AbstractText) {
        this.AbstractText = AbstractText;
    }

    List<AbstractText> AbstractText;

    public String getCopyrightInformation() {
        return this.CopyrightInformation;
    }

    public void setCopyrightInformation(String CopyrightInformation) {
        this.CopyrightInformation = CopyrightInformation;
    }

    String CopyrightInformation;
}

AbstractText.java

package articlemetadata.pubmed.efetch;

import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonRawValue;
import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlProperty;
import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlText;

@JsonIgnoreProperties(value = { "i" , "b", "sup", "sub", "u"})
public class AbstractText {

    @JacksonXmlProperty(isAttribute = true)
    public String getLabel() {
        return this.Label;
    }

    public void setLabel(String Label) {
        this.Label = Label;
    }

    String Label;

    @JacksonXmlProperty(isAttribute = true)
    public String getNlmCategory() {
        return this.NlmCategory;
    }

    public void setNlmCategory(String NlmCategory) {
        this.NlmCategory = NlmCategory;
    }

    String NlmCategory;

    public String getText() {
        return text;
    }

    public void setText(String text) {
        this.text = text;
    }

    @JacksonXmlText
    @JsonRawValue
    String text;
}

driver

    final XmlMapper xmlMapper = XmlMapper.xmlBuilder()
            .propertyNamingStrategy(PropertyNamingStrategies.UPPER_CAMEL_CASE)
            .build();

    String input = "                <Abstract>\n" +
            "                    <AbstractText><i>Objective</i>. Holographic mixed reality (HMR) allows for the superimposition of computer-generated virtual objects onto the operator's view of the world. Innovative solutions can be developed to enable the use of this technology during surgery. The authors developed and iteratively optimized a pipeline to construct, visualize, and register intraoperative holographic models of patient landmarks during spinal fusion surgery. <i>Methods.</i> The study was carried out in two phases. In phase 1, the custom intraoperative pipeline to generate patient-specific holographic models was developed over 7 patients. In phase 2, registration accuracy was optimized iteratively for 6 patients in a real-time operative setting. <i>Results.</i> In phase 1, an intraoperative pipeline was successfully employed to generate and deploy patient-specific holographic models. In phase 2, the registration error with the native hand-gesture registration was 20.2 &#xb1; 10.8&#xa0;mm (n = 7 test points). Custom controller-based registration significantly reduced the mean registration error to 4.18 &#xb1; 2.83&#xa0;mm (n = 24 test points, <i>P</i> &lt; .01). Accuracy improved over time (B = -.69, <i>P</i> &lt; .0001) with the final patient achieving a registration error of 2.30 &#xb1; .58&#xa0;mm. Across both phases, the average model generation time was 18.0 &#xb1; 6.1&#xa0;minutes (n = 6) for isolated spinal hardware and 33.8 &#xb1; 8.6&#xa0;minutes (n = 6) for spinal anatomy. <i>Conclusions.</i> A custom pipeline is described for the generation of intraoperative 3D holographic models during spine surgery. Registration accuracy dramatically improved with iterative optimization of the pipeline and technique. While significant improvements and advancements need to be made to enable clinical utility, HMR demonstrates significant potential as the next frontier of intraoperative visualization.</AbstractText>\n" +
            "                </Abstract>\n";

    Abstract abs = xmlMapper.readValue(input, Abstract.class);
    String totalAbstract = abs.getAbstractText().get(0).getText();
    System.out.println(totalAbstract);

This only prints the value of the abstract text AFTER the "Conclusions" italics. Removing the <i> tags produces the entire string.

cowtowncoder commented 2 months ago

This does not seem like valid usage due to a few things:

  1. You are marking "i" (etc) as properties to ignore: that way all text inside <i> will be skipped, as requested. Ignore does not mean that somehow XML tag only was ignored; it means property implied by tag and contents.
  2. List<AbstractText> won't work the way you perhaps expect since there is only one <AbstractText> element -- it does bind content from multiple text segments.

In general this kind of mixed content is very difficult to make work with data binding. You may be able to work around some issues by using setters for content and combine it like so:

    private String text = "";

    @JacksonXmlText
    public void setText(String text) {
        this.text = this.text + text;
    }

but you would probably also need to have something like:

   @JsonProperty("i")
   @JsonAlias({ "other", "tags", "here" })
   public setTextFromTags(String text) {
      this.text = this.text + text;
   }
ronnoceel commented 2 months ago

Thank you for your answer.

I am reading this data in from the NCBI pubmed efetch API. <AbstractText> can sometimes be a list, but is often a list of one (like in my example).

I see the problem with why the data binding might not work in this scenario. If it changes in the future I would be happy to know, but for my use case I am using the following (lossy) workaround which I will record here for posterity:

HttpResponse<String> fetchResponse = httpClient.send(fetchRequest, HttpResponse.BodyHandlers.ofString());
String body = Optional.ofNullable(fetchResponse.body())
              .map(i -> i.replaceAll("<i>", ""))
              .map(i -> i.replaceAll("</i>", ""))
              .map(i -> i.replaceAll("<b>", ""))
              .map(i -> i.replaceAll("</b>", ""))
              .map(i -> i.replaceAll("<sup>", ""))
              .map(i -> i.replaceAll("</sup>", ""))
              .map(i -> i.replaceAll("<sub>", ""))
              .map(i -> i.replaceAll("</sub>", ""))
              .map(i -> i.replaceAll("<u>", ""))
              .map(i -> i.replaceAll("</u>", ""))
              .orElse("");

return xmlMapper.readValue(body, clazz);

There is perhaps a more elegant way of doing this but it is working for me for now. I hope this helps anyone in the future who stumbles upon this.

ronnoceel commented 2 months ago

In my ideal world, I would like to be able to specify something like @MixedContent which would let the mapper know that any content in that text field should be interpreted as a string in it's entirety. This is what I thought that @JsonRawValue might do but I was mistaken.

cowtowncoder commented 2 months ago

@JsonRawValue is sort of opposite: it allows injecting pre-formatted content (and should also work for XML although not 100% sure if it does) on serialization but does nothing on deserialization (reading). Since XML parsers (and JSON parsers for that matter) rarely have any way to return un-parsed/un-decoded content, there's not really a reliable way to get "original" content anyway, so I don't think this would ever be supported. I am also not sure it'd be good idea if it could be, for most usage.

But one idea I have had for a while (but no solid plan to implement) has been possibility of something like XmlNode as subtype of JsonNode, as binding target. Or alternatively supported use of DOM Node as binding target? In both cases target type into which XML-native representation could be bound, and then custom code could process in whatever way it makes sense. Challenging parts include separation of concerns between databinding (where JsonNode and (de)serializers are implemented) which is format-agnostic for most part (by design), and streaming level (JsonParser, FromXmlParser etc) where format-specific differences are implemented.