dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
854 stars 166 forks source link

Bug: `dependOnDependencyAssertions` adding the wrong dependency #1787

Closed federicojasson closed 4 months ago

federicojasson commented 4 months ago

This is bug report when using dependOnDependencyAssertions.

The wrong dependency is being set on certain scenarios, which causes a circular dependency chain:

Compilation errors:
definitions/table_b.sqlx: Error: Circular dependency detected in chain: [{"database":"project","name":"table_b","schema":"my_dataset"} > {"database":"project","name":"custom_assertions","schema":"my_dataset"} > project.my_dataset.table_b]

Reproduction steps

Create 3 definition files:

Table A: definitions/table_a.sqlx

config {
  type: "table",
  assertions: {
    nonNull: ["id"]
  }
}

SELECT 1 AS id

Table B: definitions/table_b.sqlx

config {
  type: "table",
  dependOnDependencyAssertions: true,
}

SELECT *
FROM ${ref("table_a")}

Custom assertions: definitions/custom_assertions.sqlx

config {
  type: "assertion",
}

SELECT 1 AS id
FROM ${ref("table_a")}
UNION ALL
SELECT 1 AS id
FROM ${ref("table_b")}

Summary:

Expected behaviour

Actual behaviour

Compilation result

{
    "tables": [
        {
            "type": "table",
            "target": {
                "schema": "my_dataset",
                "name": "table_a",
                "database": "project"
            },
            "query": "\n\nSELECT 1 AS id\n",
            "disabled": false,
            "fileName": "definitions/table_a.sqlx",
            "canonicalTarget": {
                "schema": "my_dataset",
                "name": "table_a",
                "database": "project"
            },
            "enumType": "TABLE"
        },
        {
            "type": "table",
            "target": {
                "schema": "my_dataset",
                "name": "table_b",
                "database": "project"
            },
            "query": "\n\nSELECT *\nFROM `project.my_dataset.table_a`\n",
            "disabled": false,
            "fileName": "definitions/table_b.sqlx",
            "dependencyTargets": [
                {
                    "schema": "my_dataset",
                    "name": "table_a",
                    "database": "project"
                },
                {
                    "schema": "my_dataset",
                    "name": "custom_assertions",
                    "database": "project"
                },
                {
                    "schema": "my_dataset",
                    "name": "my_dataset_table_a_assertions_rowConditions",
                    "database": "project"
                }
            ],
            "canonicalTarget": {
                "schema": "my_dataset",
                "name": "table_b",
                "database": "project"
            },
            "enumType": "TABLE"
        }
    ],
    "assertions": [
        {
            "query": "\n\nSELECT 1 AS id\nFROM `project.my_dataset.table_a`\nUNION ALL\nSELECT 1 AS id\nFROM `project.my_dataset.table_b`\n",
            "fileName": "definitions/custom_assertions.sqlx",
            "target": {
                "schema": "my_dataset",
                "name": "custom_assertions",
                "database": "project"
            },
            "dependencyTargets": [
                {
                    "schema": "my_dataset",
                    "name": "table_a",
                    "database": "project"
                },
                {
                    "schema": "my_dataset",
                    "name": "table_b",
                    "database": "project"
                }
            ],
            "canonicalTarget": {
                "schema": "my_dataset",
                "name": "custom_assertions",
                "database": "project"
            }
        },
        {
            "query": "\nSELECT\n  'id IS NOT NULL' AS failing_row_condition,\n  *\nFROM `project.my_dataset.table_a`\nWHERE NOT (id IS NOT NULL)\n",
            "fileName": "definitions/table_a.sqlx",
            "target": {
                "schema": "my_dataset",
                "name": "my_dataset_table_a_assertions_rowConditions",
                "database": "project"
            },
            "dependencyTargets": [
                {
                    "schema": "my_dataset",
                    "name": "table_a",
                    "database": "project"
                }
            ],
            "canonicalTarget": {
                "schema": "my_dataset",
                "name": "my_dataset_table_a_assertions_rowConditions",
                "database": "project"
            },
            "parentAction": {
                "schema": "my_dataset",
                "name": "table_a",
                "database": "project"
            }
        }
    ],
    "projectConfig": {
        "warehouse": "bigquery",
        "defaultSchema": "my_dataset",
        "defaultDatabase": "project",
        "defaultLocation": "US"
    },
    "graphErrors": {
        "compilationErrors": [
            {
                "fileName": "definitions/table_b.sqlx",
                "message": "Circular dependency detected in chain: [{\"database\":\"project\",\"name\":\"table_b\",\"schema\":\"my_dataset\"} > {\"database\":\"project\",\"name\":\"custom_assertions\",\"schema\":\"my_dataset\"} > project.my_dataset.table_b]",
                "stack": "Error: Circular dependency detected in chain: [{\"database\":\"project\",\"name\":\"table_b\",\"schema\":\"my_dataset\"} > {\"database\":\"project\",\"name\":\"custom_assertions\",\"schema\":\"my_dataset\"} > project.my_dataset.table_b]\n    at /var/folders/pk/qptvps491_dglp97f6_gly4r0000gn/T/tmp-24853-HY40IL04CG9e/node_modules/@dataform/core/bundle.js:17:95855\n    at Array.forEach (<anonymous>)\n    at t.Session.checkCircularity (/var/folders/pk/qptvps491_dglp97f6_gly4r0000gn/T/tmp-24853-HY40IL04CG9e/node_modules/@dataform/core/bundle.js:17:95683)\n    at t.Session.compile (/var/folders/pk/qptvps491_dglp97f6_gly4r0000gn/T/tmp-24853-HY40IL04CG9e/node_modules/@dataform/core/bundle.js:17:91103)\n    at t.main (/var/folders/pk/qptvps491_dglp97f6_gly4r0000gn/T/tmp-24853-HY40IL04CG9e/node_modules/@dataform/core/bundle.js:17:60044)\n    at /var/folders/pk/qptvps491_dglp97f6_gly4r0000gn/T/tmp-24853-HY40IL04CG9e/index.js:1:96\n    at VM2 Wrapper.apply (/Users/federico/.npm/_npx/df91715d0a4b9eb0/node_modules/vm2/lib/bridge.js:485:11)\n    at NodeVM.run (/Users/federico/.npm/_npx/df91715d0a4b9eb0/node_modules/vm2/lib/nodevm.js:497:23)\n    at compile (/Users/federico/.npm/_npx/df91715d0a4b9eb0/node_modules/@dataform/cli/worker_bundle.js:19181:23)\n    at process.<anonymous> (/Users/federico/.npm/_npx/df91715d0a4b9eb0/node_modules/@dataform/cli/worker_bundle.js:19186:36)",
                "actionName": "project.my_dataset.table_b",
                "actionTarget": {
                    "schema": "my_dataset",
                    "name": "table_b",
                    "database": "project"
                }
            }
        ]
    },
    "dataformCoreVersion": "3.0.0",
    "targets": [
        {
            "schema": "my_dataset",
            "name": "custom_assertions",
            "database": "project"
        },
        {
            "schema": "my_dataset",
            "name": "table_a",
            "database": "project"
        },
        {
            "schema": "my_dataset",
            "name": "my_dataset_table_a_assertions_rowConditions",
            "database": "project"
        },
        {
            "schema": "my_dataset",
            "name": "table_b",
            "database": "project"
        }
    ]
}

Other attempts

I tried using ${ref({name: "table_a", includeDependentAssertions: true})} in definitions/table_b.sql, instead of dependOnDependencyAssertions: true. The same issue occurs.

Tuseeq1 commented 4 months ago

The flag dependOnDependencyAssertions is used to add all assertions dependent on a dependency. This includes custom assertions as well. Circular assertion dependencies just like normal circular dependencies should be handled by users. To avoid this issue, you can explicitly list the specific assertions you need in the dependency field. This gives you more control and helps prevent circular references."

ref: https://cloud.google.com/dataform/docs/assertions#all-dependency-assertions

federicojasson commented 4 months ago

The flag dependOnDependencyAssertions is used to add all assertions dependent on a dependency. This includes custom assertions as well. Circular assertion dependencies just like normal circular dependencies should be handled by users. To avoid this issue, you can explicitly list the specific assertions you need in the dependency field. This gives you more control and helps prevent circular references."

ref: https://cloud.google.com/dataform/docs/assertions#all-dependency-assertions

D'oh 🤦‍♂️ You're right, that custom assertion is also an assertion of table_b, so it would be added as dependency for table_a.

I wonder if an assertion should be added as dependency of a table when that table is being validated (i.e. is a dependency for the assertion). Intuitively speaking, custom_assertions is checking both table_a and table_b, so when dependOnDependencyAssertions is used in table_b, it could skip assertions that depend on table_b.

Anyway, I may be ignoring side effects of doing such a thing, so I'm closing this issue. Thanks