databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

spark.sql.session.timeZone not taken into account while reading XML #650

Closed BaptistePiron closed 1 year ago

BaptistePiron commented 1 year ago

Hello,

I'm trying to read with PySpark a XML file containing a timestamp field: publicationDate="2023-05-24T08:36:01" as an attribute. This timestamp doesn't have a timezone set and the correct interpretation is to read it as a CET timestamp and not UTC timestamp.

Unfortunately, spark reads the date as a UTC timestamp. When displaying the timestamp, I have 2023-05-24 10:36:01 (with spark.sql.session.timeZone = 'CET') while I should have 2023-05-24 08:36:01

I'm reading the timestamp with: spark_session.read.format("com.databricks.spark.xml").options(timestampFormat="yyyy-MM-dd'T'HH:mm:ss",timezone="Europe/Brussels" )

I'm using "spark.jars.packages":"com.databricks:spark-xml_2.12:0.16.0".

And spark session is launched using spark.sql.session.timeZone = "CET". I double-checked that it was setting correctly with spark_session.conf.get("spark.sql.session.timeZone") which returns 'CET'.

Any idea why spark.sql.session.timeZone isn't taken into account?

Thank you for your help. Regards, Baptiste Piron

srowen commented 1 year ago

Yeah, tough one. You're treating that input as a point in time, but it is ambiguous.

I spotted one issue, but I think it's not related. spark.sql.session.timeZone always overrides the "timezone" option, but you set both.

Dates are always parsed against standard formats before custom formats, and it matches your input. One thing it will try is the "2023-05-24T08:36:01" format and assume UTC. I think this is for compatibility with how XSD / XML parsing of dates is defined?

Making this assume a different timezone could break that, though I see the argument for it. Hm.

Can you just adjust the resulting time? Either shift it by the right amount, or have it parse as string and parse as a date how you like? Any chance you can fix the upstream data source?

BaptistePiron commented 1 year ago

Hello srowen,

Thank you so much for taking the time too look at the issue. Unfortunately, I cannot edit the XML itself from the source. Overwriting the field afterwards via Spark is not really an option so far because the XML has a lot of nested fields (array of struct of array...) so it makes it difficult to use .withColumn methods.

Any other idea? Basically I need it to assume that the date is CET and not UTC.

Thanks!

srowen commented 1 year ago

It's not crazy to infer the schema to find all the relevant timestamp fields then re-parse them in a loop or something.

The only thing I can think of doing here is to try the user-specific custom timestamp format (and timezone) first. If it succeeds then take that result. That is, you'd have to specify both custom format and timezone, but it would take precedence over built-in XSD formats.

I'm always worried this could break existing code, but would only affect use cases with a custom format, and where the custom format matched but wasn't being used because a built-in format succeeded. Maybe OK.