databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Allow custom timestamp with Spark timezone property #621

Closed JorisTruong closed 1 year ago

JorisTruong commented 1 year ago

Related to issue #612 and to previous pull request #616.

There are still some issues as spark.sql.session.timeZone uses Java's TimeZone.getDefault.getID according to the source code here, and it can result in a null value.

As a result, it will be mandatory to set spark.sql.session.timeZone, otherwise spark-xml will throw an NoSuchElementException when trying to retrieve the Spark property with spark.conf.get() method. Can reproduce this when running the XmlPartitioningSuite.

We may still need a default value for the timezone.

srowen commented 1 year ago

Take a look at this change -- I think the core of this works? maybe adapt this approach https://github.com/databricks/spark-xml/pull/624

JorisTruong commented 1 year ago

I think you have the best answer; I added some more tests in the pull request. I'll try to look into why tests are failing though

srowen commented 1 year ago

I think I figured out the test failure - tiny but subtle issue in handling the param map. See my latest push

JorisTruong commented 1 year ago

@srowen thank you so much for your help!

closes #612