Rows without 100% of identical childs are dropped in the dataframe

E-HO commented 2 years ago

Hi,

I have an issue in which some rows in my final data-frame are completely missing (can be up to 50% of the original content). To get this, the XML file is read and a loop is made to read the Struct fields and explode values : everything seems OK at those sides.

OK but only when we have everywhere the same child nodes in the XML, and in the case one node have a "missing" child, so not 100% the same than the others, this make lose the whole line and the final result is not correct.

Was happy to see that a such bug was documented on the issue #513 ... but seems to not be the case while I tested on DB Runtime >= 10.3 - .5 , Scala 2.12 , Maven packages 0.12 / 0.14 / 0.15. So is it possible that there is another issue related ?

Regards,

srowen commented 2 years ago

Are you saying it's the same issue as #513 or different? can you give an example? I don't quite understand it. Always use the latest package.

E-HO commented 2 years ago

It seems to be the same : once my data frame is flattened, XML nodes that doesn't have all the same child than their siblings are removed / lost.

Example :

<parent>
     <node id="1"><child_1 id="c1_1"/><child_2 id="c2_1"/></node>
     <node id="2"><child_1 id="c1_2"/><child_2 id="c2_2"/></node>
     <node id="3"><child_1 id="c1_3"/><child_2 id="c2_3"/><some_extra_node value="foo"/></node>
</parent>

In my case "node_id" 1 & 2 doesn't have the item "some_extra_node ", so they are removed : on my DF I only get "node_id" 3. If I drop("some_extra_node"), then my DF contains the 3 rows.

srowen commented 2 years ago

Sounds like the same issue. Can you confirm you are using version 0.15.0 for sure? how do you read the data?

E-HO commented 2 years ago

I confirm I tried on last 2 versions, and on 0.15, even on 2 different clusters and by separating all other code than the strict minimum.

The files are quite "huge", not really by the size (+- 30-50 megabytes) but by the amount of nodes I'm supposed to obtain (+-10 thousand). Maybe this could be an issue because if I read the full file, I can point out some missing data. If I rewrite a file by taking one node who "works" and another who "don't work" (so 2 out of 10.000) ... it seems to work fine.

The code to read is like :

srowen commented 2 years ago

Hm, this 'works' for me in the latest version, but the example you have causes a different error. You can't have an attribute called "value" as it will cause there to be a "_value" col in the struct for this child, but, there is already by default a "_VALUE" col for the contents of the node (which you can change from its default).

Once I change that to something besides "value" I get the expected result?

root
 |-- node: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: long (nullable = true)
 |    |    |-- child_1: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |-- child_2: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |-- some_extra_node: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _bar: string (nullable = true)

+----------------------------------------------------------------------------------------------------------------------------+
|node                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------+
|[{1, {null, c1_1}, {null, c2_1}, null}, {2, {null, c1_2}, {null, c2_2}, null}, {3, {null, c1_3}, {null, c2_3}, {null, foo}}]|
+----------------------------------------------------------------------------------------------------------------------------+

E-HO commented 2 years ago

Can I send you in private one example file (not intended to be public) and a whole notebook with the scripts used for this ?

srowen commented 2 years ago

OK, srowen@gmail.com . Keep it simple please if you can :)

databricks / spark-xml

Rows without 100% of identical childs are dropped in the dataframe #589