databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

rowValidationXSDPath schema does not enforce data types from Int to String #633

Closed Yanis77240 closed 1 year ago

Yanis77240 commented 1 year ago

I'm using spark-xml in my databricks project to read data that is stored in XML nested structure. I also have a xsd schema to enforce the elements and attributes that have to be on the XML file as well as the data types for each element and attribute. One of the attribute should be of type string: <xs:attribute name="AttributeName" type="xs:string" use="required" /> One of those XML files has a value in the form of a long for this attribute: "0***3". But when reading this XML file, the row do not get reported as _corrupt_record, so I assume that the value was interpreted as a String and therefore did pass the schema validation. But after reading the file in a DataFrame, the attribute has a data type of long. And I can't really explain this behavior. The most likely reason is that because the value of the attribute is only made of integer [0-9] and therefore Spark interprets it has a long, but in this case how did the value pass the schema validation?

Note: If needed, I can provide small pieces of my xsd and xml files

srowen commented 1 year ago

I think that makes sense. In XML, everything is a string at heart. If your XSD says some element is a string, then any value will pass that validation.

How spark-xml tries to interpret those values is separate and unrelated here. It tries to parse values as something more specific than a string, and may find your values parse fine as integers, so it assumes it should (otherwise all it can do is give you everything as strings).

What do you want, a string? you can cast, or just supply your own schema that says the value is a string and shouldn't be further interpreted. That can be handy if you are dealing with ID-like values like "0001355235" which can be parsed as numbers, but, for which you probably don't want that, as they're not "really" numbers.

srowen commented 1 year ago

Oh, you can also try to interpret your XSD as the schema it will use in parsing too - that's different from validation using an XSD. XSDToSchema does this, but, not all XSDs can make sense as tabular schemas and the support is somewhat limited. That might be a better option, to get the schema you want from the XSD, rather than let it infer the schema.

Yanis77240 commented 1 year ago

Hello @srowen !

Yeah exactly, I want ID-like values such as "0001355235" to be treated as strings and not numbers. My first thought was that by providing a xsd file with rowValidationXSDPath stating that this attribute should be of type string (meaning that everything will pass since like you said, intrinsically everything is a string in XML) but also parse this value, therefore when reading it in a DataFrame, it will be by default of type String (which is not the case as it is of type long in the dataframe).

So, it means that Spark automatically parse ID-like values as long, which I find disturbing.

Other thing is that depending of the xml file, there are gonna be differents attributes and elements in it. So, creating my own schema can be quite complex (as there are more than 200 elements/attributes). As for XSDToSchema, I am using pyspark (even by using scala, it seems that my schema can't get interpreted to generate the Spark DataFrame Schema associated).

Unless there are others ways to do the parsing as well as the validation..

srowen commented 1 year ago

So, XSD validation is unrelated to what you want here. It lets you assert that the XML matches an XSD. But then it is parsed by Spark. To be clear, XSD validation has absolutely nothing to do with parsing XML here.

Spark parses the data according to the schema (not XSD) you provide. If you don't provide one, it's inferred. I don't find the result 'wrong' here, but, you are free to provide a different schema that says "this is a string".

If you don't want to write the schema by hand, then, you can infer the schema without parsing, modify the schema object, then pass that schema back to Spark again while actually reading.