databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

XML Timestamp parsing without timezone #612

Closed JorisTruong closed 1 year ago

JorisTruong commented 1 year ago

Description

Currently encountering some issue when parsing with a specific timestamp format. I have XML data with timestamp of this format: yyyy/MM/dd HH:mm:ss, and I am trying to read it using the timestampFormat option.

Looking at the parseXmlTimestamp() function in TypeCast.scala, it seems that spark-xml will be using Timestamp.from(ZonedDateTime.parse(value, format).toInstant) to parse the timestamp. However, with a format like yyyy/MM/dd HH:mm:ss, there is no timezone, which causes the function to fail: image It will then return null values for the whole timestamp column.

To reproduce

time.xml file:

<book>
    <author>John Smith</author>
    <time>2011-12-03T10:15:30Z</time>
    <time2>2011/03/12 10:15:30</time2>
</book>

PySpark code to read:

df = spark.read \
        .format("com.databricks.spark.xml") \
        .options(rowTag="book", timestampFormat="yyyy/MM/dd HH:mm:ss") \
        .schema("author STRING, time TIMESTAMP, time2 TIMESTAMP") \
        .load("<path_to_file>")

display(df)
df.printSchema()
Result: author time time2
John Smith 2011-12-03T10:15:30.000+0000 null

Suggestion

We may want to add a default timezone in the parseXmlTimestamp() function, like this: image

Any thoughts about adding a timeZone option?

private[xml] class XmlOptions(
    @transient private val parameters: Map[String, String])
  extends Serializable {

  def this() = this(Map.empty)

  val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)
  ...
  val timestampFormat = parameters.get("timestampFormat")
  val timeZone = parameters.getOrElse("timeZone", XmlOptions.DEFAULT_TIME_ZONE) // <----- HERE
  val dateFormat = parameters.get("dateFormat")
}

private[xml] object XmlOptions {
  val DEFAULT_ATTRIBUTE_PREFIX = "_"
  ...
  val DEFAULT_TIME_ZONE = "UTC" // <----- HERE

  def apply(parameters: Map[String, String]): XmlOptions = new XmlOptions(parameters)
}
  private def parseXmlTimestamp(value: String, options: XmlOptions): Timestamp = {
    val timeZone = options.timeZone // <----- HERE
    val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
      map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
    formatters.foreach { format =>
      try {
        return Timestamp.from(ZonedDateTime.parse(value, format.withZone(ZoneId.of(timeZone))).toInstant) // <----- HERE
      } catch {
        case _: Exception => // continue
      }
    }
    throw new IllegalArgumentException(s"cannot convert value $value to Timestamp")
  }

The following will make the parseXmlTimestamp() function returns this: image

which then will return this DataFrame when reading: author time time2
John Smith 2011-12-03T10:15:30.000+0000 2011-12-03T10:15:30.000+0000
srowen commented 1 year ago

I see, you want custom patterns to also specify a custom timezone. Really, if anything, it should use the default Spark timezone, for consistency with the rest of Spark. You can't control the data source to specify TZ? that would be much better. I'll comment on your PR