databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Snapshotting silently fails on values greater than int max value when source dtype changes to bigint #759

Open jelstongreen opened 3 months ago

jelstongreen commented 3 months ago

Describe the bug

We noticed one table that we snapshot that when the source data changes a col from int to bigint but the downstream snapshot is not amended that the snapshot will just continuously insert the old record whenever the snapshot is run and not error that the new value cannot be inserted.

Steps To Reproduce

Snapshot a source model with a regular int column. Change the source model column to bigint and change the value in the row to a bigint value larger than max int. Run the dbt snapshot on the source table. The snapshot should have the check strategy and have the column that's been amended in that check list.

Expected behavior

A clear and concise description of what you expected to happen. If the target dtype is int for a bigint value then there should be an error rather than a silent failure.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

Add any other context about the problem here.

henlue commented 3 weeks ago

I would have time to work on this.

It looks like currently the merge statement that is executed will always try to cast the datatypes. Even if it fails to cast the data (for example string to int) it will insert null and there will be no error from the databricks side.

I would propose to check for schema changes in dbt, by using the check_for_schema_changes macro from dbt core. It is already being used to check for schema changes in incremental models.

@benc-db should this check just happen, or would it make sense to make it configurable? Similar to incremental models there could be an on_schema_change config option. It could be "ignore" by default to keep the current behavior, and only if it is set to "fail" check_for_schema_changes would be executed.

benc-db commented 3 weeks ago

@mikealfare Would love to hear your thoughts on @henlue's comment.

mikealfare commented 3 weeks ago

Without thinking about it too much, adding on_schema_change to snapshots sounds like a good idea. We're in the process of updating snapshots to add more features in general. I would post some comments on this issue. It collects all of the snapshot work being done. I don't know if this particular feature is included, but there are certainly schema-related items on there.