Closed logisticDigressionSplitter closed 5 years ago
@athelticoSometimes So.. you want to have a schema like this below?
root
|-- PROP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- NAME: string (nullable = true)
| | |-- PVAL: string (nullable = true)
Setting attributePrefix
a empty string will produces the schema above as below:
val = "path-for-your-file"
sqlContext.xmlFile(path, rowTag = "RECORD", attributePrefix = "").printSchema()
If you meant to produce PROP as a each row, then set rowTag
PROP
.
I think athelticoSometimes's question is about how do you flatten this assuming you are working at the RECORD level
When its a simple struct then you can do something like df.sql("SELECT RECORD.PROP.PVAL from ....") But since PROP is an ArrayStruct you cant sql query past RECORD.PROP
I am also facing issues when trying to read nested XMLs to dataframes
I have the same issue, i tried to define a nested custom schema but it's not possible i think. someone have an idea how can we solve this
Hm.. Let me try to think about this further as well. Thanks all for pointing this out.
Hi,
This is just to contribute my findings, there is one use case for flattening that is easy :
given the following schema of the input DF: schema :
root
|-- title: string (nullable = true)
|-- author: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- initial: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- lastName: string (nullable = true)
show()
+--------------------+--------------------+
| title| author|
+--------------------+--------------------+
|Proper Motions of...|[[WrappedArray(J,...|
|Catalogue of 2055...|[[WrappedArray(J,...|
| null| null|
|Katalog von 3356 ...|[[WrappedArray(J)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Results of observ...|[[WrappedArray(H,...|
| AGK3 Catalogue|[[WrappedArray(W)...|
|Perth 70: A Catal...|[[WrappedArray(E)...|
lets say I want every title / authors combinations, I can achieve it with "explode" :
import org.apache.spark.sql.functions;
DataFrame exploded = src.select(src.col("title"),functions.explode(src.col("author")).as("auth"))
.select("title","auth.initial","auth.lastName");
exploded = exploded.select(exploded.col("initial"),
exploded.col("title").as("title"),
exploded.col("lastName"));
there is an equivalent in Scala as a DF function but in java I had to use import org.apache.spark.sql.functions;
root
|-- initial: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- lastName: string (nullable = true)
Now this is a relatively simple transform that expand the current row into as many rows as you have items in the array. it comes handy but there is other use cases out there which deserve proper documenting. my question now is how can I build a simple string column "J H" based on the array column initial "[J, H]".... the api functions available for the Column class dont let me do that. I guess its a learning curve issue.
+-------+--------------------+-------------+
|initial| title| lastName|
+-------+--------------------+-------------+
| [J, H]|Proper Motions of...| Spencer|
| [J]|Proper Motions of...| Jackson|
| [J, H]|Catalogue of 2055...| Spencer|
Thanks for the progress HyukjinKwon, I cant wait to see 0.3.3 package available as I cannot build the project at work (restricted environment)
here is the test file I'm toying with : nasa_subset.xml.txt
Hope it does help
Hi @Bertrandbenj Is it possible to use the explode function in flatmap. Can you please tell me how do i flatten the below array structure
root
|-- chunks: struct (nullable = true)
| |-- chunk: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- activity: string (nullable = true)
| | | |-- end: struct (nullable = true)
| | | | |-- type: string (nullable = true)
| | | | |-- value: long (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- position: struct (nullable = true)
| | | | |-- type: string (nullable = true)
| | | | |-- value: long (nullable = true)
| | | |-- start: struct (nullable = true)
| | | | |-- type: string (nullable = true)
| | | | |-- value: long (nullable = true)
| |-- type: string (nullable = true)
@HyukjinKwon is this fixed?
Hi @kkarthik21 you can explode the df on chunk it will explode the whole df into every single entry of chunk array, then you can use the resultant df to select each column you want, thus flattening the whole df.
It seems there isn't one single and clean way to do this. I could have some answers about this from local Spark community.
I hope these links are helpful.
https://community.hortonworks.com/questions/43787/can-sparksql-write-a-flattened-json-table-to-a-fil.html http://stackoverflow.com/questions/36149608/spark-sql-generate-array-of-arrays-from-the-sql-function http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element
I will bring another answer if I could know there is a better way to do so.
cc @srikanthjella FYI.
Hello enthusiasts, I am happy to note good activity here... : ) as I am myself getting into ingesting a large amount of XML Credit Reports in XML at my work... One issue I face is when compiling below code in Maven/Eclipse it fails saying "read" and hiveContext are deprecated... here's the code that fails... val hiveSchema = hiveContext.read.format("com.databricks.spark.xml").option("rowTag", rootNode).option("attributePrefix", "CF_").option("valueTag", "VALUE").load(fileWithPathName ) Error I get is as follows ---> "value read is not a member of org.apache.spark.sql.hive.HiveContext" Any help appreciated !!! Thanks Raj
@rajabhathor what's your spark version?
spark 2.10. Karthik....
I see a big discussion here for a similar problem I have in #203 . I iterated through the path and explode everything to eliminate arrays, but it is not the best way. Besides, I get an OOM error.
Anyone found any way to solve the nested array issue?
Hi all, I used explore in spark 1.6 and it works, but I got performance issue, it can't handle a faster data feed. Do we have performance improvement in spark 2.x?
Best
Try sqlContext.sql("SELECT inline(PROP) from
@Bertrandbenj or anyone know of a clever way to automagically explode/flatten when the schema is not known up front?
@samthebest use this comment
this lib automagicaly create dataframe
schema
second variant create a case class and work with dataset
Let me leave this close.
Thanks for the very helpful module. I have the following XML structure that gets converted to Row of POP with the sequence inside. Is there any way to map attribute with NAME and PVAL as value to Columns in dataframe?
This is what I end up with