databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
506 stars 226 forks source link

Flattening Nested XMLs to DataFrame #91

Closed logisticDigressionSplitter closed 5 years ago

logisticDigressionSplitter commented 8 years ago

Thanks for the very helpful module. I have the following XML structure that gets converted to Row of POP with the sequence inside. Is there any way to map attribute with NAME and PVAL as value to Columns in dataframe?

<RECORDS>
 <RECORD>
    <PROP NAME="A">
      <PVAL>ValueA</PVAL>
    </PROP>
    <PROP NAME="B">
      <PVAL>ValueB</PVAL>
    </PROP>
 </RECORD>
</RECORDS>

This is what I end up with

root
 |-- PROP: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @NAME: string (nullable = true)
 |    |    |-- PVAL: string (nullable = true)
HyukjinKwon commented 8 years ago

@athelticoSometimes So.. you want to have a schema like this below?

root
 |-- PROP: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- NAME: string (nullable = true)
 |    |    |-- PVAL: string (nullable = true)

Setting attributePrefix a empty string will produces the schema above as below:

val = "path-for-your-file"
sqlContext.xmlFile(path, rowTag = "RECORD", attributePrefix = "").printSchema()

If you meant to produce PROP as a each row, then set rowTag PROP.

Bertrandbenj commented 8 years ago

I think athelticoSometimes's question is about how do you flatten this assuming you are working at the RECORD level

When its a simple struct then you can do something like df.sql("SELECT RECORD.PROP.PVAL from ....") But since PROP is an ArrayStruct you cant sql query past RECORD.PROP

premsagarreddy commented 8 years ago

I am also facing issues when trying to read nested XMLs to dataframes

bbary commented 8 years ago

I have the same issue, i tried to define a nested custom schema but it's not possible i think. someone have an idea how can we solve this

HyukjinKwon commented 8 years ago

Hm.. Let me try to think about this further as well. Thanks all for pointing this out.

Bertrandbenj commented 8 years ago

Hi,

This is just to contribute my findings, there is one use case for flattening that is easy :

given the following schema of the input DF: schema :

root
 |-- title: string (nullable = true)
 |-- author: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- initial: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- lastName: string (nullable = true)

show()

+--------------------+--------------------+
|               title|              author|
+--------------------+--------------------+
|Proper Motions of...|[[WrappedArray(J,...|
|Catalogue of 2055...|[[WrappedArray(J,...|
|                null|                null|
|Katalog von 3356 ...|[[WrappedArray(J)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Results of observ...|[[WrappedArray(H,...|
|      AGK3 Catalogue|[[WrappedArray(W)...|
|Perth 70: A Catal...|[[WrappedArray(E)...|

lets say I want every title / authors combinations, I can achieve it with "explode" :

import org.apache.spark.sql.functions;
DataFrame exploded = src.select(src.col("title"),functions.explode(src.col("author")).as("auth"))
                    .select("title","auth.initial","auth.lastName");
exploded = exploded.select(exploded.col("initial"),
                        exploded.col("title").as("title"),
                        exploded.col("lastName"));

there is an equivalent in Scala as a DF function but in java I had to use import org.apache.spark.sql.functions;

root
 |-- initial: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title: string (nullable = true)
 |-- lastName: string (nullable = true)

Now this is a relatively simple transform that expand the current row into as many rows as you have items in the array. it comes handy but there is other use cases out there which deserve proper documenting. my question now is how can I build a simple string column "J H" based on the array column initial "[J, H]".... the api functions available for the Column class dont let me do that. I guess its a learning curve issue.

+-------+--------------------+-------------+
|initial|               title|     lastName|
+-------+--------------------+-------------+
| [J, H]|Proper Motions of...|      Spencer|
|    [J]|Proper Motions of...|      Jackson|
| [J, H]|Catalogue of 2055...|      Spencer|

Thanks for the progress HyukjinKwon, I cant wait to see 0.3.3 package available as I cannot build the project at work (restricted environment)

here is the test file I'm toying with : nasa_subset.xml.txt

Hope it does help

kkarthik21 commented 8 years ago

Hi @Bertrandbenj Is it possible to use the explode function in flatmap. Can you please tell me how do i flatten the below array structure

root
 |-- chunks: struct (nullable = true)
 |    |-- chunk: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- activity: string (nullable = true)
 |    |    |    |-- end: struct (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: long (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- position: struct (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: long (nullable = true)
 |    |    |    |-- start: struct (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: long (nullable = true)
 |    |-- type: string (nullable = true)

@HyukjinKwon is this fixed?

rpslive commented 8 years ago

Hi @kkarthik21 you can explode the df on chunk it will explode the whole df into every single entry of chunk array, then you can use the resultant df to select each column you want, thus flattening the whole df.

HyukjinKwon commented 8 years ago

It seems there isn't one single and clean way to do this. I could have some answers about this from local Spark community.

I hope these links are helpful.

https://community.hortonworks.com/questions/43787/can-sparksql-write-a-flattened-json-table-to-a-fil.html http://stackoverflow.com/questions/36149608/spark-sql-generate-array-of-arrays-from-the-sql-function http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element

I will bring another answer if I could know there is a better way to do so.

HyukjinKwon commented 8 years ago

cc @srikanthjella FYI.

rajabhathor commented 8 years ago

Hello enthusiasts, I am happy to note good activity here... : ) as I am myself getting into ingesting a large amount of XML Credit Reports in XML at my work... One issue I face is when compiling below code in Maven/Eclipse it fails saying "read" and hiveContext are deprecated... here's the code that fails... val hiveSchema = hiveContext.read.format("com.databricks.spark.xml").option("rowTag", rootNode).option("attributePrefix", "CF_").option("valueTag", "VALUE").load(fileWithPathName ) Error I get is as follows ---> "value read is not a member of org.apache.spark.sql.hive.HiveContext" Any help appreciated !!! Thanks Raj

kkarthik21 commented 8 years ago

@rajabhathor what's your spark version?

rajabhathor commented 8 years ago

spark 2.10. Karthik....

someonehere15 commented 8 years ago

I see a big discussion here for a similar problem I have in #203 . I iterated through the path and explode everything to eliminate arrays, but it is not the best way. Besides, I get an OOM error.

Anyone found any way to solve the nested array issue?

dtungbmw commented 7 years ago

Hi all, I used explore in spark 1.6 and it works, but I got performance issue, it can't handle a faster data feed. Do we have performance improvement in spark 2.x?

Best

vnktsh commented 7 years ago

Try sqlContext.sql("SELECT inline(PROP) from ")

samthebest commented 7 years ago

@Bertrandbenj or anyone know of a clever way to automagically explode/flatten when the schema is not known up front?

BioQwer commented 5 years ago

@samthebest use this comment

this lib automagicaly create dataframe schema

second variant create a case class and work with dataset

HyukjinKwon commented 5 years ago

Let me leave this close.