NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
802 stars 234 forks source link

[FEA] Support testing beyond the range of Python datetime range #10040

Open NVnavkumar opened 10 months ago

NVnavkumar commented 10 months ago

Is your feature request related to a problem? Please describe.

9996 allows us to test the full "valid" range of timestamps (0001-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999) in Spark. However, Spark can even support several invalid timestamps as well (negative years and 6 digit years). We should allow this full range of inputs to Spark with CPU and GPU support.

NVnavkumar commented 10 months ago

Another aspect to consider: When you pass a Python datetime object with timezone information, then it will be converted to UTC before sending it Spark. This produces an date value out of range Python exception.

However, this also means that the effective range for testing dates for timestamps which have positive offset from UTC is actually restricted. You can only start datetime values at 0001-01-01 00:00:00.000000 UTC time, so a time that is 0001-01-01 00:00:00.000000 in a local timezone ahead of UTC cannot actually be sent back to Spark from Python because it will be converted to UTC before sending to Spark (which will be a Year 0 timestamp in UTC) that is out of range. That value is still valid though for ANSI purposes (0001-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999 is the valid range) and Spark is okay with those values.

NVnavkumar commented 5 months ago

I think the best course of action here is to write tests in Scala to handle multiple non-UTC timezones and dates that are in the invalid (non-positive and >9999 years) instead of trying to do this in Python because it's difficult to change how Python handles things in the PythonRunner on the executor.