databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

<xs:choice maxOccurs="unbounded"> does not produce array type #687

Open marpetr opened 2 months ago

marpetr commented 2 months ago

Example input:

    <xs:complexType name="Fruits">
        <xs:choice maxOccurs="unbounded">
            <xs:element name="apple" type="js:Apple"/>
            <xs:element name="orange" type="js:Orange"/>
        </xs:choice>
    </xs:complexType>

Expected output: fruits: struct<apple: array<struct<...>>, orange: array<struct<...>>>

Actual output: fruits: struct<apple: struct<...>, orange: struct<...>>

Proposed fix: https://github.com/databricks/spark-xml/blob/ddd1ef573a5318748763fafc974e4f7d8876fd6f/src/main/scala/com/databricks/spark/xml/util/XSDToSchema.scala#L227

-               if (element.getMaxOccurs == 1) {
+               if (element.getMaxOccurs == 1 && choice.getMaxOccurs == 1) {
srowen commented 2 months ago

I think it would be fine to support this. I think the change is somewhat different though.

If an xs:element within the xs:choice has maxOccurs > 1, then that choice is an array type. That much works now.

If xs:choice has maxOccurs > 1, then the result isn't a struct, but an array of struct. The resulting Seq of StructField would have to be wrapped up in another ArrayType in this case to express this, I think.

If you can try that and it works would you open a pull request to test?

ranadheerg commented 1 month ago

@marpetr @srowen can I take over this issue ? I would like to work on this if possible ?

srowen commented 1 month ago

Sure, you can open a pull request if you like