databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

org.xml.sax.SAXParseException: Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000. #630

Closed aditi-kumari-singh closed 1 year ago

aditi-kumari-singh commented 1 year ago

Unable to parse nested xml using pyspark and XSDs, returns below error org.xml.sax.SAXParseException: Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000.

In Java APIs the fix is to set : jdk.xml.maxOccurLimit=0 , where can we do this in databricks.

srowen commented 1 year ago

It doesn't sound like this question is about spark-xml itself, right? You can configure your SAX parser by setting attributes on the parser factory object.

aditi-kumari-singh commented 1 year ago

actually the error comes when used with spark-xml library.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.0.1.4 executor driver): java.util.concurrent.ExecutionException: org.xml.sax.SAXParseException; systemId: file:/local_disk0/spark-0e00459b-fab1-47ed-bf54-658a2466adc3/userFiles-f8eafb2e-843f-481a-9cc5-d74a7934083c/auth.079.001.02_xxxxx_1.1.0.xsd; lineNumber: 5846; columnNumber: 99; Current configuration of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000.

srowen commented 1 year ago

Can you say more about how this arises? I ask because you mention XSDs. Also is there more to the stack trace in the logs? what is maxOccurs in your XSD?