databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

XSDToSchema fails on choice of sequence #675

Closed iWantToKeepAnon closed 6 months ago

iWantToKeepAnon commented 7 months ago

https://github.com/databricks/spark-xml/blob/b2611bd20e917a75b7e96f5eb5cbc78f5ab21740/src/main/scala/com/databricks/spark/xml/util/XSDToSchema.scala#L223

I looks like choice can only have an element or xs:Any per line 223 of v0.17.0. Below is a snippet of an industry standard flight identification XSD. If I remove the choice or the sequence it works, but fails w/ both. The XSD is saying your key has to be (Flt# and iataCode) or just (Flt#). That is the choice of a sequence or an element. The failure stacktrace is below the XSD snippet.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified"
    targetNamespace="http://aeec.aviation-ia.net/633" version="4"
    xmlns="http://aeec.aviation-ia.net/633"
    xmlns:altova="http://www.altova.com/xml-schema-extensions"
    xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:complexType name="FlightIdentificationType">
        <xs:choice>
            <xs:sequence>
                <xs:element form="qualified" minOccurs="0" name="FlightNumber" nillable="false" type="xs:string">
                    <xs:annotation>
                        <xs:documentation>commercial flight number</xs:documentation>
                    </xs:annotation>
                </xs:element>
                <xs:element form="qualified" minOccurs="0" name="iataCode" nillable="false" type="xs:string">
                    <xs:annotation>
                        <xs:documentation>airport code</xs:documentation>
                    </xs:annotation>
                </xs:element>
            </xs:sequence>
            <xs:element form="qualified" name="FlightNumber" nillable="false" type="xs:string">
                <xs:annotation>
                    <xs:documentation>commercial flight number</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:choice>
    </xs:complexType>

    <xs:element form="qualified" name="FlightIdentification" nillable="false" type="FlightIdentificationType">
        <xs:annotation>
            <xs:documentation>flight identifier or commercial flight number</xs:documentation>
        </xs:annotation>
    </xs:element>
</xs:schema>
Py4JJavaError: An error occurred while calling z:com.databricks.spark.xml.util.XSDToSchema.read.
: scala.MatchError: org.apache.ws.commons.schema.XmlSchemaSequence@1324c6d7 (of class org.apache.ws.commons.schema.XmlSchemaSequence)
    at com.databricks.spark.xml.util.XSDToSchema$.$anonfun$getStructFieldsFromParticle$2(XSDToSchema.scala:224)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructFieldsFromParticle(XSDToSchema.scala:224)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructField(XSDToSchema.scala:173)
    at com.databricks.spark.xml.util.XSDToSchema$.$anonfun$getStructType$1(XSDToSchema.scala:200)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructType(XSDToSchema.scala:195)
    at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:74)
    at com.databricks.spark.xml.util.XSDToSchema.read(XSDToSchema.scala)
    at jdk.internal.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)
srowen commented 7 months ago

Yeah this definitely does not parse all or even most XSDs, just simple ones. But yes this match looks like it could try to handle a sequence inside a choice. Maybe it can just make a recursive call? Probably several parts of this should handle more complex structure just by calling back to the same method, even.

I don't work on this and it's not maintained now that it's copied into Spark 4, but if you have a straightforward change that makes your case work, I can get it in here and copy to Spark.

iWantToKeepAnon commented 6 months ago

Thanks for the reply. I don't have scala experience or tooling and our schema uses groups and other unsupported syntax; so I can't take the time to pursue creating a PR. This is a great tool, I wish it fit our project.