TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.82k stars 160 forks source link

Capitalization issues with table and column names in BigQuery #2598

Closed mazum21 closed 6 months ago

mazum21 commented 6 months ago

column names

column names which are defined with capital letters in the model sql are lower cased in the BigQuery table (e.g. OrderId -> orderid) -> expectation: preserve capitalization of column names

table names

table names with capital letters are not detected by the model selector

  1. example
    > venv ❯ sqlmesh render fct_SomeFactData
    Error: Cannot find model for 'fct_SomeFactData'
  2. example
    > sqlmesh plan --restate-model "some_schema.fct_SomeFactData" --start "2024-05-08"
    2024-05-10 12:27:41,261 - MainThread - sqlmesh.core.selector - WARNING - Expression 'some_schema.fct_somefactdata' doesn't match any models. (selector.py:209)
    2024-05-10 12:27:41,261 - MainThread - sqlmesh.core.selector - WARNING - Expression 'some_schema.fct_SomeFactData' doesn't match any models. (selector.py:171)
    Provided restated models do not match any models. No models will be included in plan. Provided: some_schema.Incremental_Model
    Error: The start and end dates can't be set for a production plan without restatements.
tobymao commented 6 months ago

sqlmesh follows the same rules as the engine. in bigquery's case, table names are only upper cased if they are quoted, so you'd need to define your model as

 model ( name `fct_SomeFactData`) 

otherwise it will be lower cased. additionally, bigquery column names are always case insensitive, so they should always be lower cased.

if you model name is indeed capitalized, then in the selector, you'll need to also quote it... sqlmesh render

"`fct_SomeFactData`"

can you show me your model definition?

mazum21 commented 6 months ago

I don't find any BigQuery documentation which tells that table names are only upper cased if they are quoted.

This is an interesting page: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical But there seems to be no rule that doesn't allow capital letters in unquoted identifiers. Maybe I'm missing something.

In any case I find it hard to navigate through the quotation jungle. I get mixed results and some commands cannot be executed, others can.

One simple example which should illustrates this and also the issue with the lower casing of column names:

Model:

MODEL (
  name `sqlmesh_test.stg_Table`,
  kind VIEW,
  start '2024-05-01',
  cron '@daily'
);

SELECT *
FROM `sample-project`.`sample_dataset`.`externalTable`

Create external models

-> works and table description in the schema.yml looks like this

- name: '`sample-project`.`sample_dataset`.`externalTable`'
  columns:
    TableId: INT64
    TableName: STRING

Render the model

-> works, but lower cases the columns, although they have capital letters in the name

> sqlmesh render "sqlmesh_test.stg_Table"
SELECT                                                                                                                                                                                                
  `externaltable`.`tableid` AS `tableid`,
  `externaltable`.`createdat` AS `createdat`                                                                                                                
FROM `dg-ub-snowplow-prod.sample_dataset.externalTable` AS `externaltable`     

Creating the model with sqlmesh plan

-> creates a table which preserves upper case characters in table name, but lower cases the column names.

Restate the model

-> results in a warning and nothing happens

> sqlmesh plan --restate-model "sqlmesh_test.stg_Table" 
2024-05-16 21:40:40,775 - MainThread - sqlmesh.core.selector - WARNING - Expression 'sqlmesh_test.stg_table' doesn't match any models. (selector.py:209)
2024-05-16 21:40:40,775 - MainThread - sqlmesh.core.selector - WARNING - Expression 'sqlmesh_test.stg_Table' doesn't match any models. (selector.py:171)
Provided restated models do not match any models. No models will be included in plan. Provided: sqlmesh_test.stg_Table
No differences when compared to `prod`
eakmanrq commented 6 months ago

Thanks @mazum21 The detailed feedback was really helpful.

So one thing that Toby mispoke about was the logic applied to BigQuery tables. They are case-sensitive by default which I believe is what you were saying. Docs: https://cloud.google.com/bigquery/docs/tables#table_naming

Table names are case-sensitive by default. mytable and MyTable can coexist in the same dataset, unless they are part of a dataset with case-sensitivity turned off.

The good news is that this is already understood by SQLGlot and we aren't seeing any bugs related to this.

Column though are case-insensitive regardless of quoting. Docs: https://cloud.google.com/bigquery/docs/schemas#column_names

For example, a column named Column1 is considered identical to a column named column1. To learn more about column naming rules, see [Column names](https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#column_names) in the GoogleSQL reference.

Reviewing your examples everything looks correct except the final restate-model . Fix for that is in this PR: https://github.com/TobikoData/sqlmesh/pull/2623

Let me know if that clears things up.