databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Timestamps not matching format are replaced with nulls #662

Closed dolfinus closed 10 months ago

dolfinus commented 10 months ago

Hi.

I'm trying to parse simple xml file:

<item>
  <created-at>2021-01-01T01:01:01+00:00</created-at>
</item>
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show()
Result: created-at
2021-01-01 01:01:01

But if timestamp does not match format, e.g. T is replaced with space:

<item>
  <created-at>2021-01-01 01:01:01+00:00</created-at>
</item>
It is read as null: created-at
null

I see that there is an option mode with PERMISSIVE as default, which leads to when it encounters a field of the wrong datatype, it sets the offending field to null. But malformed value is not being added to column _corrupt_record because there is nothing wrong with xml structure. So there is no way to detect if input file contains tag with wrong field value or nullValue, unless user set a different mode. Is that desired behavior?

srowen commented 10 months ago

You did not include the column _corrupt_record in your schema. It's automatically added if you infer the schema, otherwise you need to add it. If not present, it can't be added.

dolfinus commented 10 months ago

Tried:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType, StringType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType()), StructField("_corrupt_record", StringType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show(10, False)
|created-at|_corrupt_record                                                      |
|----------|---------------------------------------------------------------------|
|null      |<item>\n  <created-at>2021-01-01 01:01:01+00:00</created-at>\n</item>|

It is worth mentioning in Readme that _corrupt_record should be explicitly added to dataframe schema.