databrickslabs / remorph

Cross-compiler and Data Reconciler into Databricks Lakehouse
Other
28 stars 16 forks source link

[BUG]: Snowflake SQL translations provide incorrect output for the ‘first_value’ analytic window function #782

Closed ajalisatgi closed 2 weeks ago

ajalisatgi commented 1 month ago

Is there an existing issue for this?

Category of Bug / Issue

TranspileParserError

Current Behavior

On Snowflake, the default window frame specification for rank-related functions such as first_value, last_value, nth_value is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (Snowflake Docs Reference)

However, on Databricks the first_value function is classified as an Analytic Window function and has a default window frame specification of RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when an order by clause is present. (Stackoverflow reference)

Expected Behavior

The correct translation should include explicitly adding the window frame specification ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to the PARTITION BY clause to match the default Snowflake behavior if a Window frame specification is missing in the original query.

Steps To Reproduce

Original Snowflake Query: SELECT tabb.col_a, tabb.col_b, First_value( CASE WHEN tabb.col_c IN (‘xyz’, ‘abc’) THEN tabb.col_d END) ignore nulls OVER (partition BY tabb.col_e ORDER BY tabb.col_f) AS derived_col_a FROM schema_a.table_a taba LEFT JOIN schema_b.table_b AS tabb ON taba.col_e = tabb.col_e

Translated Query: SELECT tabb.col_a, tabb.col_b, FIRST_VALUE(CASE WHEN tabb.col_c IN (‘xyz’, ‘abc’) THEN tabb.col_d END) IGNORE NULLS OVER (PARTITION BY tabb.col_e ORDER BY tabb.col_f NULLS LAST) AS derived_col_a FROM schema_a.table_a AS taba LEFT JOIN schema_b.table_b AS tabb ON taba.col_e = tabb.col_e

Expected Query to match Snowflake SQL output: SELECT tabb.col_a, tabb.col_b, FIRST_VALUE(CASE WHEN tabb.col_c IN (‘xyz’, ‘abc’) THEN tabb.col_d END) IGNORE NULLS OVER (PARTITION BY tabb.col_e ORDER BY tabb.col_f NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS derived_col_a FROM schema_a.table_a AS taba LEFT JOIN schema_b.table_b AS tabb ON taba.col_e = tabb.col_e

Relevant log output or Exception details

No response

Sample Query

SELECT    tabb.col_a,
          tabb.col_b,
          First_value(
          CASE
                    WHEN tabb.col_c IN (‘xyz’,
                                        ‘abc’) THEN tabb.col_d
          END) ignore nulls OVER (partition BY tabb.col_e ORDER BY tabb.col_f) AS derived_col_a
FROM      schema_a.table_a taba
LEFT JOIN schema_b.table_b AS tabb
ON        taba.col_e = tabb.col_e

Operating System

macOS

Version

v0.1.5