Closed giuseppeceravolo closed 2 years ago
There's not a way to do it right now, but yeah I think that's a relatively simple feature request -- if the idea is to have one new name for all array items, not per type or something. I could probably add that now
As of now I do not have any other array column in the output XML file so just one name would be enough, thank you! I am working on Databricks where my cluster has version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) and "com.databricks:spark-xml_2.12:0.14.0" installed. Please let me know how I can get your changes from there.
By the way, do you believe it has something to do with the fact that "records" has 2 nested "element" fields? Maybe if there is a way to rename "element" into "record", do you believe I could fix it by myself?
For the sake of completeness, I am adding the code I use to write the XML file.
df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option('rootTag', 'inventory xmlns="http://www.demandware.com/xml/impex/inventory/2007-05-31"') \
.option('rowTag', 'inventory-list') \
.mode('overwrite') \
.save('/mnt/container/folder/file.xml')
To get the changes, I'd have to make the changes and release a new version. That might take some time. But then you just install a new version as usual.
"element" isn't really part of the schema w.r.t. how you access it in Spark, that's not related, no. You can't rename it, it doesn't matter.
Sorry to bother you, just so I know, when is version 0.16 going to be released? Because this feature is requested for one of my current projects. Thank you so much! 😁
I hadn't planned to make a new release for a while. Can you just build the library from source and use it right now?
I see. Is it possible to do so on Databricks? If so, could you please be so kind to point out the best way to do it? Thank you for your support.
Sure, in Databricks you can just attach a JAR file to a cluster. You just need to build a JAR file -- one including all dependencies -- from the project. Check out the code and run sbt assembly
and you should find the JAR in target/scala-2.12/spark-xml-assembly-0.16.0.jar
. When it's released you'd also be able to just add it by Maven coordinates rather than build it
Hi 😃 it's me again! Instead of having one name for all array items, now I need a way to specify the name of the array for each element... Do you believe it could be possible to have such enhancement? Thank you in advance!
Something like the following:
<?xml version="1.0" encoding="UTF-8"?>
<Inventory>
<Item>
...
<ItemAttribute Name="ATTRIBUTE1">
<AttributeCodeValue>
<AttributeCode>1</AttributeCode>
<AttributeValue>Value1</AttributeValue>
</AttributeCodeValue>
</ItemAttribute>
<ItemAttribute Name="ATTRIBUTE2">
<AttributeCodeValue>
<AttributeCode>2</AttributeCode>
<AttributeValue>Value2</AttributeValue>
</AttributeCodeValue>
</ItemAttribute>
<ItemAttribute Name="ATTRIBUTE3">
<AttributeCodeValue>
<AttributeCode>3</AttributeCode>
<AttributeValue>Value3</AttributeValue>
</AttributeCodeValue>
</ItemAttribute>
....
<ItemFranchisees>
<ItemFranchisee action="ADD" franchiseeId="F1" franchiseeName="F1"/>
<ItemFranchisee action="ADD" franchiseeId="F2" franchiseeName="F2"/>
<ItemFranchisee action="ADD" franchiseeId="F3" franchiseeName="F3"/>
</ItemFranchisees>
</Item>
</Inventory>
With the code below, as of 0.16.0, I get the elements inside "ItemFranchisees" named as "AttributeCodeValue", but I would like them to be named as "ItemFranchisee" (see example above).
df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option('rootTag', 'Inventory') \
.option('rowTag', 'Item') \
.option('arrayElementName', 'AttributeCodeValue') \
.mode('overwrite') \
.save('/mnt/container/folder/file.xml')
I don't think that's possible to support here easily. You can further transform the XML file with a library
I see. Thank you anyway for your prompt reply
I am writing the XML file below and would like to know how I can rename "item" with "record" (since those items are within the "records" tag). Perhaps there is a way to change the value "item" in here.
Here is the schema of my dataframe: