databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Spark-xml not running on Databricks Runtime 11.0 #599

Closed TheDataDexter closed 2 years ago

TheDataDexter commented 2 years ago

I am trying to update my Databricks runtime to the newest version (DBR 11.0). However, the spark-xml package is not being installed properly. On the older Databricks runtimes the package is installed with no problems.

MAVEN coordinates: com.databricks:spark-xml_2.12:0.14.0

DBR 11.0 configurations:

DBR 10.5 configurations:

Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Library resolution failed because problem during retrieve of com.databricks#dbc-parent: java.lang.RuntimeException: Multiple artifacts of the module net.sourceforge.f2j#arpack_combined_all;0.1 are retrieved to the same file! Update the retrieve pattern to fix this error.

srowen commented 2 years ago

I think you're reporting vs DBR 11, not Spark 3.3.0 per se. That runs fine, or at least the test suites do. That's also not the most recent version, 0.15.0 is, though shouldn't make much difference.

I can't reproduce this on DBR 11 though. Cluster startup is fine after the library is installed too.

The error does not look directly related to spark-xml; it is unrelated to arpack. Are you sure it's not some other library you are installing?

TheDataDexter commented 2 years ago

Thank you for your feedbackl. I was able to solve the bug by upgrading to version 0.15.0. This article helped me understand how databricks goes about the management of maven libraries.

srowen commented 2 years ago

I don't think that's related. That is also just describing how libraries are handled generally in Maven, nothing specific to Databricks. I don't believe this is related to spark-xml