NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[FEA] Add support for `MonthsBetween` #11709

Closed kuhushukla closed 1 day ago

kuhushukla commented 2 weeks ago

Is your feature request related to a problem? Please describe.

cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.MonthsBetween

Describe the solution you'd like Add a GpuMonthsBetween Implementation for this expression

Describe alternatives you've considered None Additional context Seen in customer queries

revans2 commented 2 weeks ago

Wow this is not simple.

months_between(timestamp1, timestamp2[, roundOff]) - If timestamp1 is later than timestamp2, then the result is positive. If timestamp1 and timestamp2 are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false.

@kuhushukla do you know what time zones we need to support (at least beyond UTC)?

The spark code is equally confusing. They convert the timestamps to a LocalDateTime for the desired time zone, and pull out the date from it. Then they convert that to the number of days since the epoch (covert it to a date in the local time zone). After that they convert it back to a LocalDate and calculate the year, month, day, and number of days to the end of the month.

If the days of the month are the same (like both are the 15th) or both of them are the last day of the month, then the number of months between them is calculated exactly as a double value (year1 * 12 + month1) - (year2 * 12 - month2).

otherwise they calculate the number of seconds different between the two timestamps (in a really complicated way where each day has 24 hours worth of seconds and each month has 31 days) It is then rounded to 8 digits.

This is far from simple to do.