Open ayushdg opened 3 years ago
Hi @ayushdg,
Thanks a bunch for trying this out and an easy example to reproduce the bug.
A quick explanation of the Query gives:
c.explain(query)
LogicalProject(EXPR$0=[CAST(/INT(Reinterpret(-($1, $0)), 86400000)):INTEGER])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(CAST=[CAST($0):TIMESTAMP(0)])
LogicalTableScan(table=[[root, df]])
LogicalProject(CAST=[CAST($0):TIMESTAMP(0)])
LogicalTableScan(table=[[root, df2]])
Currently, As you mentioned, dask-sql missing Reinterpret
operator (which is somewhat similar to the cast Operation mentioned here)
For the testing purpose, I just tried using CastOperation
as a replacement for Reinterpret
. And /INT
(IntDivisionOperator
) operator was also missing, which was implemented in this PR by @nils-braun, that was yet to be merged. So Using these two Operator implementations, the timestampdiff
Operation was working fine for me.
I will try to understand more about the Reinterpret Operator and will try to raise a PR for this!
Cheers
I can take a look at this one
Hi @jdye64
Thanks a lot for looking into this issue,
I am working on this timestampdiff implementation
for the past few days, but have not been able to achieve the results completely. So looking for another pair of hands to help me sort this issue 😁
Please have a look, at some of the work done so far: https://github.com/dask-contrib/dask-sql/compare/main...rajagurunath:feature/timestampdiff
The problem:
The current implementation is able to achieve the timestampdiff for units like seconds, minute, hour, day but fails for the month, Quarter and year.
Query plan for the SQL (micro/seconds,min,hour,day ):- ✅
LogicalProject(
ms=[*(CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 1000)):INTEGER, 1000000)],
sec=[CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 1000)):INTEGER],
minn=[CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 60000)):INTEGER],
hr=[CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 3600000)):INTEGER],
dayy=[CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 86400000)):INTEGER]
)
LogicalTableScan(table=[[root, test]])
Query Plan for the SQL (month, year):❌
LogicalProject(
monthh=[CAST(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0)))):INTEGER],
yearr=[CAST(/INT(Reinterpret(-(CAST($0):TIMESTAMP(0), CAST($1):TIMESTAMP(0))), 12)):INTEGER]
)
LogicalTableScan(table=[[root, test]])
As per the plan, we are casting the two expression, ---> subtract -- > Reinterpret -- > divby the number given by calcite(60000,1000 etc)-->cast into int32
Since for the year and month, the dividend number was very low, Dask fails by not being able to cast the big floating-point number into int32.
I think I am missing something obvious while implementing /int and reinterpret, I think we need to make use of some methods from the java side as well to implement this functionality smoothly. (To get the UNITS - DAY, MINUTE, HOUR, from java side to python side? 🤔 ). are there some other methods available from the python Dask side to implement this feature?
Please let me know if you want me to raise a WIP PR for this feature to discuss further.
Thanks in Advance
cheers!
@rajagurunath I think it would be good if you could open a WIP PR. That would allow us to discuss a little more and also take advantage of CI for testing theories and changes.
Thanks a lot @jdye64, have created a PR, Please review and let me know your Suggestions/Feedback
I have been looking into this a bit more and updating #293 locally to work with the latest branch. On further inspection it looks like the main issue with month
and year
comes from calcite assuming a different time_interval
for the date subtraction vs the other time units.
Currently we handle the SqlDatetimeSubtractionOperator
the same as any regular subtraction operation but I think we might need a way to extract the time units for this operation as well from calcite based on the docs here.
I'll look into this further to see how to get that from calcite.
the
timestampdiff
operation currently fails to do missingreinterpret
implementation error.Here's a reproducer:
Error with: