dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
849 stars 160 forks source link

Support bigquery primary keys and foreign keys #1513

Open G2H opened 1 year ago

G2H commented 1 year ago

See https://cloud.google.com/blog/products/data-analytics/join-optimizations-with-bigquery-primary-and-foreign-keys

Ekrekr commented 2 months ago

Needs a one pager design doc

dallerup commented 2 months ago

Here is an alternative solution until dataform has added support for primary and foreign keys.

post_operations { ALTER TABLE ${self()} ADD PRIMARY KEY (field) NOT ENFORCED ; }

stevenleggdfe commented 2 months ago

I attempt to handle this in a much more complex (but in some respects more reliable) way using these JS functions. This is not a great approach in many respects however so would definitely benefit from support in Dataform itself:

/* Wait for a specified interval in the context of GoogleSQL procedural language (i.e. a SQL script, not an individual query) */
function wait(intervalString) {
    return `
    BEGIN
    DECLARE WAIT STRING;
    DECLARE DELAY_TIME DATETIME;
    SET WAIT = 'TRUE';
    SET DELAY_TIME = DATETIME_ADD(CURRENT_DATETIME, INTERVAL ${intervalString});
    WHILE WAIT = 'TRUE' DO
      IF (DELAY_TIME < CURRENT_DATETIME) THEN
         SET WAIT = 'FALSE';
      END IF;
    END WHILE;
    END;
    `;
}

/* Drops all current constraints on a table and sets the constraints specified in constraints. For use in a Dataform pre operation.
keyInForeignTable defaults to "id" if not specified. Composite keys may be specified for keyInThisTable and keyInForeignTable as a comma separated list.
e.g. 
post_operations {
    ${data_functions.setKeyConstraints(ctx, dataform, {
    primaryKey: "myidname",
    foreignKeys: [
        {keyInThisTable: "myotherkeyname", foreignTable: "myothertable", keyInForeignTable: "myotheridname"},
        {keyInThisTable: "anotherkeyname", foreignTable: "myotherothertable"},
        {keyInThisTable: "firstpartofcompositekey, secondpartofcompositekey", foreignTable: "myothertable", keyInForeignTable: "firstpartofcompositekeyinothertable, secondpartofcompositekeyinothertable"}
        ]
    })}
}
Using this function enables project level table definitions to avoid having to handle their own workarounds for the following issues:
1. Without putting the ALTER TABLE statements in a conditional IF TRUE THEN... END IF; block, although the script would execute without error, BigQuery query compilation unhelpfully returns an error on the line that attempts to add a primary key if a primary key already exists, ignoring the fact that a previous step in the script removes the primary key.
2. BigQuery has a 5 table updates per 10s limit which may be breached if too many constraint removal operations happen too quickly. The function builds in waits to avoid this.
*/
function setKeyConstraints(ctx, dataform, constraints) {
    /* Detect incorrectly formatted constraints objects */
    Object.keys(constraints).forEach(key => {
        if (!['primaryKey', 'foreignKeys'].includes(key)) {
            throw new Error(`Invalid parameter passed to setKeyConstraints() inside constraints object: ${key}. Valid top level parameters are primaryKey and foreignKeys.`);
        }
    });
    if (constraints.foreignKeys && constraints.foreignKeys.length > 0) {
        constraints.foreignKeys.forEach(foreignKey => Object.keys(foreignKey).forEach(foreignKeyParameter => {
            if (!['keyInThisTable', 'foreignTable', 'keyInForeignTable'].includes(foreignKeyParameter)) {
                throw new Error(`Invalid foreign key parameter passed to setKeyConstraints() inside constraints object: ${foreignKeyParameter}. Valid key level parameters are keyInThisTable, foreignTable and (optionally) keyInForeignTable.`);
            }
        }));
    }
    return `
    IF TRUE THEN
    /* Delete all key constraints on the table - even ones that are no longer included in dfeAnalyticsDataform() configuration */
      ALTER TABLE ${ctx.self()} DROP PRIMARY KEY IF EXISTS;
      FOR constraint_to_drop IN (
        SELECT
          SPLIT(constraint_name, ".")[1]
        FROM
          ${"`" + dataform.projectConfig.defaultDatabase + "." + dataform.projectConfig.defaultSchema + (dataform.projectConfig.schemaSuffix ? "_" + dataform.projectConfig.schemaSuffix : "" ) + ".INFORMATION_SCHEMA.TABLE_CONSTRAINTS`"}
        WHERE
          constraint_type = "FOREIGN KEY"
          AND table_name = "${ctx.name()}"
        )
      DO
        ALTER TABLE ${ctx.self()} DROP CONSTRAINT IF EXISTS constraint_to_drop;
        ${wait("2 SECOND")}
      END FOR;
    /* Set primary key */
      ${constraints.primaryKey ?
          `ALTER TABLE ${ctx.self()} ADD PRIMARY KEY(${constraints.primaryKey}) NOT ENFORCED;
          ${wait("2 SECOND")}`
          : ``}
    /* Set foreign key constraints */
      ${constraints.foreignKeys && constraints.foreignKeys.length > 0 ?
      `ALTER TABLE ${ctx.self()}
          ${constraints.foreignKeys.map(foreignKey =>
              `ADD CONSTRAINT ${foreignKey.keyInThisTable.replaceAll(",", "_").replaceAll(" ", "")}_relationship FOREIGN KEY(${foreignKey.keyInThisTable}) REFERENCES ${ctx.ref(foreignKey.foreignTable)}(${foreignKey.keyInForeignTable || "id"}) NOT ENFORCED`).join(',\n')
              };`
      : ``}
      END IF;
  `;
}

module.exports = {
    wait,
    setKeyConstraints
};