databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

Arrays with null values are written as empty tags on the XML file #692

Open Matew92 opened 1 week ago

Matew92 commented 1 week ago

Im using the library on a nested dataframe ex:

this is my schema:

 StructField("A", ArrayType(StructType([
      StructField("B", StructType([
          StructField("C", StringType(), True),
          StructField("D", ArrayType(StructType([
              StructField("E", StringType(), True),
              StructField("F", StringType(), True)
          ])), True)
      ]))
  ])))

This my data:

  "A": [{
            "B": {
                "C": "somthing",
                "D": [{
                    "E": None,
                    "F": None
                }]
            }
        }]

What would i expect would be somthing like:

<A>
    <B>
       <C>somthing</C>
   </B>
</A>

But i get :

<A>
    <B>
       <C>somthing</C>
       <D/>
   </B>
</A>

Did someone find the same issue? Is there a way to get the behaviour i want ? i tried with .option("ignoreNullFields", "true") but i get the same described above

srowen commented 1 week ago

I don't know if one is right-er than the other. They are slighly different situations: a child with nothing in it, vs a parent with no children. That said I don't think the current behavior is strongly motivated, just how it happened.

I would probably not change behavior at this point unless it's demonstrably problematic.

Matew92 commented 1 week ago

Hi Srowen,

thanks for your fast reply. I get the same behaviour with the fields (if a field on the df is null will be not printed in the xml file)so i was expecting the same for a empty array (or at least an option for it?)

srowen commented 1 week ago

I think there's a difference between [] and None which is sort of mirrored here - that's not a missing array, it's an empty array. I think you could argue behavior either way, neither is that much more reasonable. But I would not change behavior that's stood for so long unless it was clearly wrong.

Matew92 commented 1 week ago

Yes, I agree with you that an empty array is different from a None (so indeed, I would not change the default behavior). However, for big data purposes, having an option to print or not print empty nested arrays would be really helpful because it optimizes the size of the XML file.

For example, in my case, I get 2-3 level nested data frames, and the results are all these empty tags for the arrays in a 100GB file.

The result is something like this for each row:

  <a>
      <b>
          <c/>
          <d/>
      </b>
      <e/>
      <f>
          <g/>
          <h/>
          <i>
              <m/>
              <n/>
          </i>
          <o/>
      </f>
  </a>