databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

Parsing Nested XML :: Cannot use in Databricks/Scala environment #495

Closed big-analytics closed 4 years ago

big-analytics commented 4 years ago

I am using Azure Databricks on a single-node cluster with Spark 3.0.0, Scala 2.12 with spark-xml library installed: com.databricks:spark-xml_2.12:0.10.0

I am able to parse direct XML files, but would like to parse string xml columns from a dataframe, therefore Nested XML seemed the best solution.

I am doing this:

val df = spark.read.json("jsonfile.log")
val df_body = df.select("message.body") //xmlfield
val payloadSchema = schema_of_xml(df_body.select("body").as[String])

And getting this:

command-2027225130763768:4: error: not found: value schema_of_xml
val payloadSchema = schema_of_xml(df_body.select("body").as[String])

I would appreciate your help.

srowen commented 4 years ago

You probably didn't import com.databricks.spark.xml.schema_of_xml?

big-analytics commented 4 years ago

Forgot to say that importing does not work.

image

srowen commented 4 years ago

It sounds like you do not have the library actually installed on your cluster / with your app at all then. How are you adding it in Databricks? (try reattaching to the cluster after you install)

big-analytics commented 4 years ago

I think I have it, I can read XML files.

image

And the library is installed as far as I see.

image

srowen commented 4 years ago

That's very strange. I just tried attaching the same library to a cluster and it worked, imports and all. Try ... restarting the cluster? Not sure what could be the issue.

big-analytics commented 4 years ago

YES! That did the trick! So possible that the cluster needs a restart after the library is installed. Thanks a lot!

srowen commented 4 years ago

It usually needs at least reattaching the notebook after a JVM library is installed, for Scala. If that doesn't work yeah restart. I didn't seem to need that though, FWIW.