databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

ref attribute in XSDToSchema #617

Closed shuch3ng closed 1 year ago

shuch3ng commented 1 year ago

I tried to parse the Example 3 from https://www.w3schools.com/xml/el_element.asp

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="note">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="to"/>
      <xs:element ref="from"/>
      <xs:element ref="heading"/>
      <xs:element ref="body"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>

</xs:schema>

and got the following exception

Unsupported schema element type: null
java.lang.IllegalArgumentException: Unsupported schema element type: null
    at com.databricks.spark.xml.util.XSDToSchema$.getStructField(XSDToSchema.scala:216)
    at com.databricks.spark.xml.util.XSDToSchema$.$anonfun$getStructField$4(XSDToSchema.scala:182)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructField(XSDToSchema.scala:173)
    at com.databricks.spark.xml.util.XSDToSchema$.$anonfun$getStructType$1(XSDToSchema.scala:227)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructType(XSDToSchema.scala:221)
    at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:49)
    at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:61)

It's caused by the elements inside the complexType not having the schemaType and hence null is passed into the getStructField function in Line 182 XSDToSchema.scala

val baseType = getStructField(xmlSchema, e.getSchemaType).dataType
shuch3ng commented 1 year ago

Tried changing Line 182 to

val refQName = e.getRef.getTargetQName
val baseType = 
  if (refQName != null)
    getStructField(xmlSchema, xmlSchema.getParent.getElementByQName(refQName).getSchemaType).dataType
  else getStructField(xmlSchema, e.getSchemaType).dataType

and the extracted schema looks like below

StructType(StructField(note,StructType(StructField(null,StringType,false), StructField(null,StringType,false), StructField(null,StringType,false), StructField(null,StringType,false)),false), StructField(heading,StringType,false), StructField(from,StringType,false), StructField(to,StringType,false), StructField(body,StringType,false))

Don't have the corresponding XML to test the schema but the null names in the StructFields in note don't look right to me.

srowen commented 1 year ago

Right, that isn't supported. Your change looks to be in the right direction, to follow the 'ref', but seems like it needs a different change to be correct.

However it's reading the fields like "to" as both members of the struct and top-level elements. Is that the intent? that's what the schema seems to say too.

shuch3ng commented 1 year ago

Yes it's intent because an XSD can have multiple top-level elements. In this example, "to", "from", "heading" and "body" are all globally defined so they can be referenced in the schema and also be used as the root elements.

I managed to get the correct field names from ref. Will add the test and create a PR.

srowen commented 1 year ago

No that's not what I mean. The 'global' definitions are part of the schema too. Is that what you intend? that is, does "body" really appear twice in the schema?

shuch3ng commented 1 year ago

Yes it appears twice and yes it's what I'm trying to achieve. Probably a better example below, which is a modified XSD from what I encountered in my work.

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="book">
    <xsd:complexType>
      <xsd:sequence>
          <xsd:element name="name" type="xsd:string" form="qualified"/>
          <xsd:element name="author" type="xsd:string" form="qualified"/>
          <xsd:element name="isbn" type="xsd:string" form="qualified"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name="bookList" type="BookList"/>
  <xsd:complexType name="BookList">
    <xsd:sequence>
      <xsd:element ref="book" minOccurs="0" maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

This XSD contains two top-level elements book and bookList. Depending on demand, bookList or book should be extracted from the following XMLs using the XSD.

<bookList>
    <book>
        <name>Functional Programming in Scala</name>
        <author>Michael Pilquist, Runar Bjarnason, Paul Chiusano</author>
        <isbn>9781617299582</isbn>
    </book>
    <book>
        <name>Spark : The Definitive Guide</name>
        <author>Bill Chambers, Matei Zaharia</author>
        <isbn>9781491912218</isbn>
    </book>
</bookList>
<book>
    <name>Spark : The Definitive Guide</name>
    <author>Bill Chambers, Matei Zaharia</author>
    <isbn>9781491912218</isbn>
</book>

And with the current XSDToSchema, the XSD cannot be parsed because it cannot handle ref attribute and throws an exception. So even the book schema cannot be retrieved.

srowen commented 1 year ago

OK. I don't think that's going to work here without more significant change, but you're welcome to try it. You can of course just write out the desired schema, or infer it from actual data.

shuch3ng commented 1 year ago

OK. I don't think that's going to work here without more significant change, but you're welcome to try it. You can of course just write out the desired schema, or infer it from actual data.

I did consider these two but there are two problems.

  1. There are more than 200 fields in this nested structure. Writing out the schema is really a pain. And there are more schemas for other data to come...
  2. Some fields are optional and the raw XML doesn't have them. So I cannot infer it.