databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

strange tag while writing xml with nullValue #652

Closed groneveld closed 1 year ago

groneveld commented 1 year ago

while writing xml with:

df.repartition(1).write.format("xml")\
    .option("rootTag", "DataContainer")\
    .option("rowTag", "Свед")\
    .mode("overwrite")\
    .save(data_path))

get strange tag in resulting xml for elements, which were passed with Null values. for example:

<БазСвед>
    <Фамилия>some text</Фамилия>
     ....
</БазСвед>
<СведЕСИА/>

Check the last row. also, pay attention at slash and its position in tag.

Using: spark-xml_2.12-0.14.0.jar txw2-2.3.4.jar scala 2.12.15 spark 3.3.1

srowen commented 1 year ago

What is strange about this?

groneveld commented 1 year ago

@srowen this is neither open nor close tag. it's like alone. If we have some data inside, tags should look like: <БазСвед> # open .... </БазСвед> #close

and none of them if there is null.

groneveld commented 1 year ago

@srowen it happens each time we have nested stucture with all null nested elements. So, that's how the parent element is written. I guess, it should not be written at all.

srowen commented 1 year ago

All the tags you show are closed, I'm not sure what you're referring to?

groneveld commented 1 year ago

@srowen

<БазСвед>
    <Фамилия>some text</Фамилия>
     ....
</БазСвед>
<СведЕСИА/>

<СведЕСИА/> - this one from my example. it's alone

srowen commented 1 year ago

No, that is a closed tag. It's empty. <foo/> is the same as <foo></foo> in XML.

groneveld commented 1 year ago

@srowen ok, but is there any way it won't be written at all as it is empty? all nested tags are not written and that's what we expect. but why this parent tag is still written? even being empty

srowen commented 1 year ago

I don't think it's possible, no. You can post-process the XML how you like though.