dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.52k stars 1.58k forks source link

[CT-2857] Model contracts: raise warning for numeric types without specified scale #8183

Closed jtcohen6 closed 11 months ago

jtcohen6 commented 1 year ago

slack thread & https://github.com/dbt-labs/dbt-core/issues/7824#issuecomment-1586217515

A few data platforms (Snowflake, Redshift) have scale=0 for their numeric/decimal data type. (It's 9 by default on BigQuery.)

Postgres' docs say:

The SQL standard requires a default scale of 0, i.e., coercion to integer precision. We find this a bit useless. If you're concerned about portability, always specify the precision and scale explicitly.

I agree!

Reproduction case

Thanks @gnilrets for the clear repro case! If we run this on Snowflake:

-- models/fct_test_contract.sql
select
   2/3 as two_thirds

Without contract enforcement, two_thirds is a number(7,6). However, if I enforce the contract and fail to specify the precision:

models:
  - name: fct_test_contract
    config:
      contract:
        enforced: true

    columns:
      - name: two_thirds
        data_type: number

Then the build passes, but two_thirds is now a number(38,0), so I’ve lost all precision and two_thirds = 1!

Acceptance Criteria

--- updated proposal ---

If the user has provided a numeric data type, which should specify precision/scale, without having specified anything except the type — raise a warning.

graciegoheen commented 1 year ago

We should:

https://community.snowflake.com/s/article/To-Float-or-Not-to-Float-Choosing-Correct-Numeric-Data-Type-in-Snowflake

jtcohen6 commented 1 year ago

After poking around this some more yesterday:

Screenshot 2023-08-01 at 10 23 39
create or replace table dbt_jcohen.my_tbl (id numeric not null) as (
    select 2/3 as id
);

Is equivalent to cast(2/3 as numeric), rather than 2/3::numeric.

I'm inclined to say that float is the way to go, when you don't want to specify a precision & scale — understanding that there might be some performance benefits to fixed-point over floating-point numerics.

If we want to raise a warning for numeric types without precision/scale specified, we don't have a perfect mechanism for doing that right now (short of regex), but the is_numeric adapter method could be a starting point.

graciegoheen commented 1 year ago

We discussed this further in a community feedback session, one concern that was brought up is that floats aren't always reliable from a rounding perspective (especially for accounting data).

codigo-ergo-sum commented 1 year ago

Thanks @graciegoheen. Here's details, for example, from the BigQuery documentation about the dangers of using FLOAT (bolding mine):

Floating point values are approximations.

    The binary format used to represent floating point values can only represent a subset of the numbers between the most positive number and most negative number in the value range. This enables efficient handling of a much larger range than would be possible otherwise. **Numbers that are not exactly representable are approximated by utilizing a close value instead. For example, 0.1 cannot be represented as an integer scaled by a power of 2. When this value is displayed as a string, it is rounded to a limited number of digits, and the value approximating 0.1 might appear as "0.1", hiding the fact that the value is not precise. In other situations, the approximation can be visible**.
    Summation of floating point values might produce surprising results because of [limited precision](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accuracy_problems). For example, (1e30 + 1) - 1e30 = 0, while (1e30 - 1e30) + 1 = 1.0. This is because the floating point value does not have enough precision to represent (1e30 + 1), and the result is rounded to 1e30. This example also shows that the result of the SUM aggregate function of floating points values depends on the order in which the values are accumulated. In general, this order is not deterministic and therefore the result is not deterministic. Thus, **the resulting SUM of floating point values might not be deterministic and two executions of the same query on the same tables might produce different results**.
    If the above points are concerning, use a [decimal type](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#decimal_types) instead.

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#floating_point_types

And from Snowflake:

Floating point operations can have small rounding errors in the least significant digit(s). Rounding errors can occur in any type of floating-point processing, including trigonometric functions, statistical, and geospatial functions.

Errors can vary each time the query is executed.

Errors can be larger when operands have different precision or scale.

**Errors can accumulate, especially when aggregate functions (e.g. SUM() or AVG()) process large numbers of rows.** Casting to a fixed-point data type before aggregating can reduce or eliminate these errors.

Rounding errors can occur not only when working with SQL, but also when working with other code (e.g. Java, JavaScript, or Python) that runs inside Snowflake (e.g. in [UDFs](https://docs.snowflake.com/en/sql-reference/udf-overview.html) and [stored procedures](https://docs.snowflake.com/en/sql-reference/stored-procedures-overview.html)).

**When comparing two floating-point numbers, Snowflake recommends comparing for approximate equality rather than exact equality.**

https://docs.snowflake.com/en/sql-reference/data-types-numeric#rounding-errors

In my humble opinion, best to avoid floating point data types entirely unless the use case specifically calls for them (e.g. scientific measurements, engineering machinery, perhaps geographic lat/long kinds of representation.).

gnilrets commented 1 year ago

Being involved with scientific and engineering measurements, and concerned about issues with precision truncation and overflow, I find it best to avoid decimal data types entirely unless the use case specifically calls for them (e.g., currency, counters)

I think the best dbt can do for us here is to enforce that any decimal types specify the precision/scale

codigo-ergo-sum commented 1 year ago

@gnilrets - I get your point that there are cases where fixed-point datatypes would also be problematic. My experience is though, that most folks who are doing scientific computing tend to understand the "gotchas" of datatypes and also the underlying issues with rounding that will occur in any computational representation of numbers versus their pure Platonic forms :).

However, I think the reverse is not true - most folks who are doing non-scientific analytics (and I think 90%+, maybe even 95%+ of analytics are non-scientific) don't know much of anything about numeric datatypes, rounding, etc. E.g. folks doing FP&A modeling or ad spend analysis.

In short, if you don't already know what a float is, and when you should and should not use it, you probably shouldn't be using it :). But this happens all the time - I see all kinds of data pipelines using floats all over the place when they're doing financial analysis. I've seen major data replication and ingest vendors, on multiple occasions, cast financial data incoming from source systems where the data had fixed-point representations into FLOAT in Snowflake and BigQuery. So that seems like the much more common error to avoid. And if those are all reasonable assumptions then we should not ever automatically exacerbate these issues by casting fixed point to floating point datatypes silently.

codigo-ergo-sum commented 1 year ago

@graciegoheen - I had another idea on this. What if, when someone enters a datatype in a data contract, you check INFORMATION_SCHEMA after the table has been created, and if the text of the returned datatype specified doesn't match what was specified in the contract, it throws a warning?

So, for example, if the data contract specifies NUMERIC but INFORMATION_SCHEMA in the database in question comes back with NUMERIC(38,0) you could throw a warning saying "Datatype specified in contract for column X does not match that created in database and datatype coercion/conversion may have taken place during model contract execution. It is advisable to check the actual created datatype in the database for this column". Of course you'd want the comparison to be case-insensitive.

That would keep you out of the business of trying to specifically parse out the details of particular datatypes on particular database platforms and instead you're just doing a direct lower-cased text string comparison with no specific logic in it which is a lot simpler.

jtcohen6 commented 12 months ago

This has been a great (and humbling) conversation :)

I think our goals here should be to:

To that end, I think we could accomplish this with a few tweaks to one macro. If the user has provided a numeric data type, which should specify precision/scale, without having specified anything except the type — raise a warning.

I wish we could use the adapter API method Column.is_numeric for this, but it leads to too many false positives on Snowflake, because integer types are actually just aliases for numeric(38, 0) (docs) !! I don't think we should raise a warning when users specify a "naked" integer, when the whole point is that the precision/scale need to be specified.

So, while this approach is pretty naive, it's not doing anything functional (just raising a warning), and should do what we want (help users) in the vast majority of cases.

{% macro default__get_empty_schema_sql(columns) %}
    {%- set col_err = [] -%}
    {%- set col_naked_numeric = [] -%}
    select
    {% for i in columns %}
      {%- set col = columns[i] -%}
      {%- if col['data_type'] is not defined -%}
        {%- do col_err.append(col['name']) -%}
      {#-- If this column's type is just 'numeric' missing precision/scale, raise a warning --#}
      {#- elif api.Column.create(col['name'], col['data_type'].strip()).is_numeric() -#}
      {%- elif col['data_type'].strip().lower() in ('numeric', 'decimal', 'number') -%}
        {%- do col_naked_numeric.append(col['name']) -%}
      {%- endif -%}
      {% set col_name = adapter.quote(col['name']) if col.get('quote') else col['name'] %}
      cast(null as {{ col['data_type'] }}) as {{ col_name }}{{ ", " if not loop.last }}
    {%- endfor -%}
    {%- if (col_err | length) > 0 -%}
      {{ exceptions.column_type_missing(column_names=col_err) }}
    {%- elif (col_naked_numeric | length) > 0 -%}
      {#-- TODO: This should be an actual identifiable warning / log event --#}
      {{ log("Detected columns with numeric type and unspecified precision/scale, this can lead to unintended rounding: " ~ col_naked_numeric, info = true) }}
    {%- endif -%}
{% endmacro %}

With the example above, if I express the type as numeric in the model's contract:

$ dbt run
...
08:05:20 1 of 1 START sql view model dbt_jcohen.fct_test_contract ....................... [RUN]
08:05:21 Detected columns with numeric type and unspecified precision/scale: ['two_thirds']
08:05:21 1 of 1 OK created sql view model dbt_jcohen.fct_test_contract .................. [SUCCESS 1 in 1.08s]
...

If I switch it to numeric(38,10):

$ dbt run
...
08:06:22 1 of 1 START sql view model dbt_jcohen.fct_test_contract ....................... [RUN]
08:06:23 1 of 1 OK created sql view model dbt_jcohen.fct_test_contract .................. [SUCCESS 1 in 1.20s]
MichelleArk commented 11 months ago

I wish we could use the adapter API method Column.is_numeric for this, but it leads to too many false positives on Snowflake, because integer types are actually just aliases for numeric(38, 0) (docs) !! I don't think we should raise a warning when users specify a "naked" integer, when the whole point is that the precision/scale need to be specified.

Welp, this part is probably the yuckiest bit to swallow. But I do appreciate the naive solution here.

{%- elif col['data_type'].strip().lower() in ('numeric', 'decimal', 'number') -%}

maybe at the very least we could encapsulate this check in a global-level (in dbt-core) macro so we have something consistent to use if we are going to lean on this 'best effort numeric check' at the global-level

MichelleArk commented 11 months ago

dbt-bigquery overrides default__get_empty_schema_sql: https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L7