databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Text fields with embedded tags #614

Closed kornel-at-swoop closed 1 year ago

kornel-at-swoop commented 1 year ago

I have been able to parse out a very deeply nested complex XML file successfully with this library but ran into a problem when there were tags embedded in a text field.

Our project parses public data provided as is in XML format, so we have no control over the schema. One of the text fields has embedded HTML tags for italic and bold.

I used Spark's ability to infer schema to pretty much accurately create a defined schema, then I saved that schema and used it to subsequently read the documents in production. But I noticed that Spark wasn't able to infer the schema when text was mixed with embedded tags.

  1. Is there any way to tell spark-xml not to try to parse a subtree under a certain XML tag, but simply give me that flattened out text from that subtree of tags? (BTW, this feature would be very useful in other situations as well)

  2. If not, would it be possible to supply the XML as a string instead of a file, then I could maybe just do a simple search and replace to remove the unwanted tags before handing them to spark-xml?

If this is not possible I will have to front each transformation with an XSLT style preprocessing step just to accomplish this which would be extremely inefficient since we are dealing with many thousands of very large XML documents. All in all we are parsing tens of millions of articles.

Any suggestions for what the best practices are to use the tool in this fashion would be greatly appreciated.

Thanks. Kornel

srowen commented 1 year ago

Are the embedded tags actually tags? or escaped? This shouldn't be a problem:

<myTag>
  &lt;html&gt;...
</myTag>

This would be a "problem" if the tags aren't really to be understood as tags, even though they are:

<myTag>
  <html>...
</myTag>

I'm assuming the latter, and yeah that's kinda 'invalid' semantically according to the intent, but hey.

Just set the type of myTag to 'string' in the schema? you have to specify the schema to do this (or perhaps modify the inferred schema). Then the body of myTag is just treated as text no matter what.

If you can extract the XML of each row upfront, rather than let the library do it, and can form a Dataset or RDD of Strings (each containing XML for one row), then yes you can parse that with methods in XmlReader, like xmlDataset. You could transform the XML ahead of time that way if desired, sure.

kornel-at-swoop commented 1 year ago

Thank you for responding!

I changed it to StringType but this doesn't work either. It stops parsing once it hits the <I> tag, and everything afterwords is set to null.

    StructField(
      "AbstractText",
      StringType,
      true
    )

Even though this does not seem to work, this seems like how it should ideally work because it would be very clean. Basically, if I could just tell the parser to treat a subtree as text and skip parsing it, it would be very elegant. I don't know how feasible it would be to make this work.

If the above cannot be made work, I wouldn't mind doing it the way you suggest in your last paragraph: Read the xml into a data frame one XML string at a time, rip out all <i> and </i> tags, then pass it in to the parser as XML strings, one line at a time.

I conceptually understand what you are proposing and I like it but I cannot quite wrap my brain around how to do this cleanly and concisely. It sounds like the kind of thing that could be done with a single line of Scala code if I know what I'm doing. Would you be able to point me to an example where an XML file is read into a data frame and then passed onto the parser in two distinct steps. Then perhaps between those two distinct steps I could inject my search and replace to get rid of <i> tags.

Is there somewhere more deeper documentation where I could find examples of doing this kind of thing?

Thanks. Kornel

srowen commented 1 year ago

Are you sure? I just tested and treating the col as a string in the schema seems to work. Like if I parse book from this, specifying that it has one string col text:

<books>
    <book>
        <text>
            <html>
                Some <i>text</i>.
            </html>
        </text>
    </book>
</books>

I get

+----------------------------------------+
|text                                    |
+----------------------------------------+
|<html>\n                Some <i>text</i>|
+----------------------------------------+
srowen commented 1 year ago

As for removing tags, just use a regex. After you parse the XML rows out (the hard part) as strings in a Dataset, just map those strings with something like:

val removeTagRegex = "<[^>]+>".r
...
.... .map { text => removeTagRegex.replaceAllIn(text, "") }
kornel-at-swoop commented 1 year ago

I was able to get your example to work pretty fast but mine just did not work. It took me a while to figure out this was because I needed to bracket the <I></I> tags in <HTML></HTML>. Otherwise, the parsing will fail.

The problem is that when this kind of situation happens (random HTML tags in text), the source is often not under the control of the engineer who is parsing the XML. In this case, we're parsing a public database and they have no <HTML>, only <I>.

Is there ay way we could make this work even when we don't have the random tags bracketed with <HTML>?

kornel-at-swoop commented 1 year ago

It's not the HTML tag. Even with that it stops parsing after seeing a couple of italic tags. So it seems indeterministic. I came up with a concrete example that does not parse:

<books>
    <book>
        <text>
            Lorem ipsum dolor sit amet. Ut <i>voluptas</i> distinctio et impedit deserunt aut quam fugit et quaerat odit et nesciunt earum non dolores culpa et sunt nobis. Aut accusamus iste sed odio debitis et quasi amet rem quam sequi et voluptatem placeat aut voluptates iste? Vel nisi rerum sit eligendi excepturi et galisum animi et ipsa nihil vel consequatur velit eos velit nesciunt.
            Quo voluptatibus sint ab officiis aperiam non obcaecati rerum eos veniam iste eum ipsam modi. <i>Non</i> voluptatem illum qui molestiae magni qui maxime commodi et accusantium similique qui necessitatibus <i>minus</i>?
            At quod rerum et porro nisi ut tempore error et enim optio cum Quis voluptatibus qui dolores sapiente cum cupiditate quia. Ut incidunt neque aut provident quaerat qui quia <i>illum</i>. Ab esse commodi ad earum molestias non internos atque non <i>consequatur</i> inventore 33 galisum nobis hic distinctio impedit! Est dicta iusto est <i>numquam</i> incidunt cum autem temporibus.
        </text>
    </book>
</books> 

My schema for this was: StructType(StructField(book,StructType(StructField(text,StringType,true)),true))

This produced the following output (stopped at first <I> tag): {"text": "\n Lorem ipsum dolor sit amet. Ut "}

kornel-at-swoop commented 1 year ago

Adding the HTML tag made it go further but then it only seems to parse the italic text. It omits everything else:

The XML with the HTML tag added:

<books>
    <book>
        <text>
            <html>Lorem ipsum dolor sit amet. Ut <i>voluptas</i> distinctio et impedit deserunt aut quam fugit et quaerat odit et nesciunt earum non dolores culpa et sunt nobis. Aut accusamus iste sed odio debitis et quasi amet rem quam sequi et voluptatem placeat aut voluptates iste? Vel nisi rerum sit eligendi excepturi et galisum animi et ipsa nihil vel consequatur velit eos velit nesciunt.
            Quo voluptatibus sint ab officiis aperiam non obcaecati rerum eos veniam iste eum ipsam modi. <i>Non</i> voluptatem illum qui molestiae magni qui maxime commodi et accusantium similique qui necessitatibus <i>minus</i>?
            At quod rerum et porro nisi ut tempore error et enim optio cum Quis voluptatibus qui dolores sapiente cum cupiditate quia. Ut incidunt neque aut provident quaerat qui quia <i>illum</i>. Ab esse commodi ad earum molestias non internos atque non <i>consequatur</i> inventore 33 galisum nobis hic distinctio impedit! Est dicta iusto est <i>numquam</i> incidunt cum autem temporibus.</html>
        </text>
    </book>
</books>

The output:

object
text: "<html>Lorem ipsum dolor sit amet. Ut <i>voluptas</i><i>Non</i><i>minus</i><i>illum</i><i>consequatur</i><i>numquam</i>"
srowen commented 1 year ago

OK I see what's going on. PR incoming to make both of these scenarios work.

kornel-at-swoop commented 1 year ago

Thank you!