NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

[FEA] Support parsing dates and timestamps with timezone IDs in the string #6846

Open revans2 opened 2 years ago

revans2 commented 2 years ago

Is your feature request related to a problem? Please describe. When parsing CSV or JSON, and when casting from a string to a timestamp or a date it is allowed to have a timezone id in the string itself. Ideally we should be able to support this. It might take some custom kernel work to do it. But if we can have the names of all of the time zones cached on the GPU along with the offset tables similar to what is described in https://github.com/NVIDIA/spark-rapids/issues/6831 and https://github.com/NVIDIA/spark-rapids/issues/6840 then we could detect zones that are UTC+/- and GMT +/- along with the actual named zones. (There are also aliases for UTC/etc that we will need a way to express too). But with that, then the parsing code can look up the corresponding timezone and adjust the result appropriately when parsing the data. This is likely going to need more custom kernels to make this fully work.

res-life commented 10 months ago

Alfred is co-working on this.

NVnavkumar commented 9 months ago

To note: It's probably more relevant to support offsets in the string as opposed to full timezone IDs. Offsets are part of the toString output of OffsetDateTime in Java, and the ISO 8601 format (using the 'Z'). The ISO 8601 format is common enough that it should be the priority of support in this feature.

res-life commented 7 months ago

I'm not planing work on this for release 24.04 @sameerz can we move it to next release?