databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Adding complexContent Support for XsdToSchema #554

Closed exandi closed 1 year ago

exandi commented 3 years ago

Hey Community, is there any thing in the pipeline to support the XSD-Element complexContent

Currently I am planning to read an XSD File to an Spark DF Schema, but if the XSD contains an complexContent Field the conversion fails.

srowen commented 3 years ago

I don't plan to work on this, but you are welcome to propose a PR if it's clean and adds support for this. But do you mean XSD validation, or XSD to schema? complexContent probably won't translate to a Spark schema. For XSD validation it probably already works? because it's just applying a standard XML parser with the XSD to it.

exandi commented 3 years ago

I tried to convert an XSD File to a Schema, will provide tomorrow the details, but if I comment the compexContent out it works fine, else it crashes. I will also provide the stacktrace tomorrow.

exandi commented 3 years ago

Hey, sorry for the delay.

I use a kerborised hadoopcluster (HDP 3.1.4) with spark 2.3.2.3.1.4.0-315 To get a Spark context I use Jupyter Notebook with the following command:

import os, sys
os.environ["HADOOP_CONF_DIR"] = "/etc/hadoop_spark/conf"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYSPARK_PYTHON"] = os.environ["HOME"] + "/nfs/bin/python"
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--jars /jars/spark-xml_2.11-0.12.0.jar","/jars/xmlschema-core-2.2.5.jar pyspark-shell'

I need to add the xmlschema-core.jar as well. Else I get a ClassNotFound Exception.

After I created a Spark Context via:

spark = SparkSession \
    .builder \
    .appName("XMLTest") \
    .enableHiveSupport() \
    .config("spark.driver.allowMultipleContexts", "true") \
    .getOrCreate()

I am able to call the XSDToSchema read Method via: print(str(spark._jvm.com.databricks.spark.xml.util.XSDToSchema.read(xsdasstring))) where xsdasstring is the XSD-file read in.

the XSD-File looks like:

<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

if the element employee is of type fullpersoninfo the code crashes with the following exception. If it is of type personinfo it works and the Schema is printed out.


Py4JJavaError Traceback (most recent call last)

in 1 with open("test-fail.xsd","r") as file: 2 xsdasstring = file.read() ----> 3 print(str(spark._jvm.com.databricks.spark.xml.util.XSDToSchema.read(xsdasstring))) /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args: /usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling z:com.databricks.spark.xml.util.XSDToSchema.read. : scala.MatchError: org.apache.ws.commons.schema.XmlSchemaComplexContent@5f025710 (of class org.apache.ws.commons.schema.XmlSchemaComplexContent) at com.databricks.spark.xml.util.XSDToSchema$.com$databricks$spark$xml$util$XSDToSchema$$getStructField(XSDToSchema.scala:123) at com.databricks.spark.xml.util.XSDToSchema$$anonfun$getStructType$1.apply(XSDToSchema.scala:213) at com.databricks.spark.xml.util.XSDToSchema$$anonfun$getStructType$1.apply(XSDToSchema.scala:208) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.databricks.spark.xml.util.XSDToSchema$.getStructType(XSDToSchema.scala:208) at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:79) at com.databricks.spark.xml.util.XSDToSchema.read(XSDToSchema.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

The schema that is shown with personinfo:

StructType(StructField(employee,StructType(StructField(firstname,StringType,false), StructField(lastname,StringType,false)),false))

I'll hope the error is reproducible in Scala? Tell me if you need some more information. Maybe I will find some deeper insides in the next days.

srowen commented 3 years ago

That is expected. This XSD element is not supported, as we said. It may not easily translate to a tabular schema.

srowen commented 2 years ago

Hm, I just made a fix for a similar issue, which I'll release soon in 0.13.0. I'm not sure it's the same issue but worth trying again after 0.13.0 is out. https://github.com/databricks/spark-xml/pull/559

shuch3ng commented 1 year ago

I was trying to parse the Example 2 in https://www.w3schools.com/xml/el_extension.asp

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="employee" type="fullpersoninfo"/>

<xs:complexType name="personinfo">
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
  </xs:sequence>
</xs:complexType>

<xs:complexType name="fullpersoninfo">
  <xs:complexContent>
    <xs:extension base="personinfo">
      <xs:sequence>
        <xs:element name="address" type="xs:string"/>
        <xs:element name="city" type="xs:string"/>
        <xs:element name="country" type="xs:string"/>
      </xs:sequence>
    </xs:extension>
  </xs:complexContent>
</xs:complexType>

</xs:schema>

and got the following exception

Unsupported content model: org.apache.ws.commons.schema.XmlSchemaComplexContent@5fe94a96
java.lang.IllegalArgumentException: Unsupported content model: org.apache.ws.commons.schema.XmlSchemaComplexContent@5fe94a96
    at com.databricks.spark.xml.util.XSDToSchema$.getStructField(XSDToSchema.scala:222)
    at com.databricks.spark.xml.util.XSDToSchema$.$anonfun$getStructType$1(XSDToSchema.scala:235)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at com.databricks.spark.xml.util.XSDToSchema$.getStructType(XSDToSchema.scala:230)
    at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:49)
    at com.databricks.spark.xml.util.XSDToSchema$.read(XSDToSchema.scala:61)
    at com.databricks.spark.xml.util.XSDToSchemaSuite.$anonfun$new$9(XSDToSchemaSuite.scala:129)
    at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
    at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    at org.scalatest.Transformer.apply(Transformer.scala:22)
    at org.scalatest.Transformer.apply(Transformer.scala:20)
    at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
    at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
    at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
    at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
    at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
    at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
    at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
    at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
    at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
    at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
    at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
    at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
    at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
    at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
    at org.scalatest.Suite.run(Suite.scala:1114)
    at org.scalatest.Suite.run$(Suite.scala:1096)
    at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
    at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
    at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
    at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
    at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
    at org.scalatest.funsuite.AnyFunSuite.run(AnyFunSuite.scala:1564)
    at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
    at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
    at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
    at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
    at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
    at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
    at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
    at org.scalatest.tools.Runner$.run(Runner.scala:798)
    at org.scalatest.tools.Runner.run(Runner.scala)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)

It's caused by the pattern matching of complexType.getContentModel only handling XmlSchemaSimpleContent and null but not XmlSchemaComplexContent (from Line 122 in XSDToSchema.scala).

I have added the code to parse this extension element within complexContent and created the pull request https://github.com/databricks/spark-xml/pull/631.