databrickslabs / remorph

Cross-compiler and Data Reconciler into Databricks Lakehouse
Other
47 stars 30 forks source link

[FEATURE]: Cast to original data type when using derived tables sampling queries #1195

Open wajdi-db opened 1 week ago

wajdi-db commented 1 week ago

Is there an existing issue for this?

Category of feature request

Reconcile

Problem statement

Some databases such as Teradata use infer their data type minimally based on the literal value provided. This is problematic for the WITH recon as () query generated by the framework to sample data that is mismatched or missing.

Consider this statement that would be generated by the framework:

WITH recon AS (
    SELECT
        4.00 AS some_value -- Inferred as decimal(1,2) by TD
    FROM dual
    UNION
    SELECT
        1222.34 AS some_value --Throws numeric overflow exception as it doesn't fit decimal(1,2)
    FROM dual
)
select *
from recon;

Because the first value of the WITH recon statement is lower than the next ones in the UNION statement, resulting data type will be inferred as decimal(1,2) by Teradata engine, and the query will fail after the union since the next value doesn't fit in this data type. Teradata error thrown: [Teradata Database] [Error 2616] Numeric overflow occurred during computation. Oracle won't throw this exception because it infers data types in derived tables with unions differently.

Proposed Solution

Cast every column generated (only primary keys are generated this way) to their original data type, unless there is a user_transformation existing on the column, as this could possibly change the desired data type depending on the user transformation. Note that user transformations are pre-applied to the literal value generated. So we can't apply them here again.

Proposed fix: https://github.com/databrickslabs/remorph/commit/b5786ff665a6d5bbc2ece1d6d01ffb3764609e17

Additional Context

No response