databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Can "item" of ArrayType be renamed via an option when writing an XML file? #602

Closed giuseppeceravolo closed 2 years ago

giuseppeceravolo commented 2 years ago

I am writing the XML file below and would like to know how I can rename "item" with "record" (since those items are within the "records" tag). Perhaps there is a way to change the value "item" in here.

<?xml version="1.0" encoding="UTF-8"?>
<inventory xmlns="http://www.domain.com/xml/">
    <inventory-list>
        <header list-id="myShop">
            <default>false</default>
        </header>
        <records>
            <item product-id="xxxxxx-yyy1">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy2">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy3">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy4">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy5">
                <qty>0</qty>
            </item>
        </records>
    </inventory-list>
</inventory>

Here is the schema of my dataframe:

root
 |-- header: struct (nullable = true)
 |    |-- _list-id: string (nullable = true)
 |    |-- default: boolean (nullable = true)
 |-- records: array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- _product-id: string (nullable = true)
 |    |    |    |-- qty: integer (nullable = true)
srowen commented 2 years ago

There's not a way to do it right now, but yeah I think that's a relatively simple feature request -- if the idea is to have one new name for all array items, not per type or something. I could probably add that now

giuseppeceravolo commented 2 years ago

As of now I do not have any other array column in the output XML file so just one name would be enough, thank you! I am working on Databricks where my cluster has version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) and "com.databricks:spark-xml_2.12:0.14.0" installed. Please let me know how I can get your changes from there.

By the way, do you believe it has something to do with the fact that "records" has 2 nested "element" fields? Maybe if there is a way to rename "element" into "record", do you believe I could fix it by myself?

giuseppeceravolo commented 2 years ago

For the sake of completeness, I am adding the code I use to write the XML file.

df \
  .coalesce(1) \
  .write \
  .format('com.databricks.spark.xml') \
  .option('declaration', 'version="1.0" encoding="UTF-8"') \
  .option('rootTag', 'inventory xmlns="http://www.demandware.com/xml/impex/inventory/2007-05-31"') \
  .option('rowTag', 'inventory-list') \
  .mode('overwrite') \
  .save('/mnt/container/folder/file.xml')
srowen commented 2 years ago

To get the changes, I'd have to make the changes and release a new version. That might take some time. But then you just install a new version as usual.

"element" isn't really part of the schema w.r.t. how you access it in Spark, that's not related, no. You can't rename it, it doesn't matter.

srowen commented 2 years ago

https://github.com/databricks/spark-xml/pull/603

giuseppeceravolo commented 2 years ago

Sorry to bother you, just so I know, when is version 0.16 going to be released? Because this feature is requested for one of my current projects. Thank you so much! 😁

srowen commented 2 years ago

I hadn't planned to make a new release for a while. Can you just build the library from source and use it right now?

giuseppeceravolo commented 2 years ago

I see. Is it possible to do so on Databricks? If so, could you please be so kind to point out the best way to do it? Thank you for your support.

srowen commented 2 years ago

Sure, in Databricks you can just attach a JAR file to a cluster. You just need to build a JAR file -- one including all dependencies -- from the project. Check out the code and run sbt assembly and you should find the JAR in target/scala-2.12/spark-xml-assembly-0.16.0.jar. When it's released you'd also be able to just add it by Maven coordinates rather than build it

giuseppeceravolo commented 1 year ago

Hi 😃 it's me again! Instead of having one name for all array items, now I need a way to specify the name of the array for each element... Do you believe it could be possible to have such enhancement? Thank you in advance!

Something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<Inventory>
    <Item>
        ...
        <ItemAttribute Name="ATTRIBUTE1">
            <AttributeCodeValue>
                <AttributeCode>1</AttributeCode>
                <AttributeValue>Value1</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        <ItemAttribute Name="ATTRIBUTE2">
            <AttributeCodeValue>
                <AttributeCode>2</AttributeCode>
                <AttributeValue>Value2</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        <ItemAttribute Name="ATTRIBUTE3">
            <AttributeCodeValue>
                <AttributeCode>3</AttributeCode>
                <AttributeValue>Value3</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        ....
        <ItemFranchisees>
            <ItemFranchisee action="ADD" franchiseeId="F1" franchiseeName="F1"/>
            <ItemFranchisee action="ADD" franchiseeId="F2" franchiseeName="F2"/>
            <ItemFranchisee action="ADD" franchiseeId="F3" franchiseeName="F3"/>
        </ItemFranchisees>
    </Item>
</Inventory>

With the code below, as of 0.16.0, I get the elements inside "ItemFranchisees" named as "AttributeCodeValue", but I would like them to be named as "ItemFranchisee" (see example above).

df \
  .coalesce(1) \
  .write \
  .format('com.databricks.spark.xml') \
  .option('declaration', 'version="1.0" encoding="UTF-8"') \
  .option('rootTag', 'Inventory') \
  .option('rowTag', 'Item') \
  .option('arrayElementName', 'AttributeCodeValue') \
  .mode('overwrite') \
  .save('/mnt/container/folder/file.xml')
srowen commented 1 year ago

I don't think that's possible to support here easily. You can further transform the XML file with a library

giuseppeceravolo commented 1 year ago

I see. Thank you anyway for your prompt reply