datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
484 stars 94 forks source link

Introduce DATETIME and JSON in BigQuery converter #421

Closed ffernandez92 closed 2 months ago

ffernandez92 commented 2 months ago

I've noticed some inconsistencies in the data types used by the BigQuery converter. In BigQuery, there are additional data types not accounted for in the converter:

Initially, I considered adding extra if statements to the map_type_to_bigquery function in datacontract/export/bigquery_converter.py, but that would introduce inconsistencies for other types of contracts.

Example:

def map_type_to_bigquery(type_str: str, field_name: str) -> str:
    logger = logging.getLogger(__name__)
    if type_str.lower() in ["string", "varchar", "text"]:
        return "STRING"
    elif type_str == "bytes":
        return "BYTES"
    elif type_str.lower() in ["int", "integer"]:
        return "INTEGER"
    elif type_str.lower() in ["long", "bigint"]:
        return "INT64"
    elif type_str == "float":
        return "FLOAT64"
    elif type_str == "boolean":
        return "BOOL"
    elif type_str.lower() in ["timestamp", "timestamp_tz"]:
        return "TIMESTAMP"
    elif type_str == "date":
        return "DATE"
    elif type_str == "timestamp_ntz":
        return "TIME"
    elif type_str == "datetime":
        return "DATETIME"

(DATETIME is not there, is just to show the example).

In the importer, both json and datetime exist. However, as you can see, both timestamp and datetime are mapped to timestamp internally in the datacontract data types.

Importer method:

def map_type_from_bigquery(bigquery_type_str: str):
    if bigquery_type_str == "STRING":
        return "string"
    elif bigquery_type_str == "BYTES":
        return "bytes"
    elif bigquery_type_str == "INTEGER":
        return "int"
    elif bigquery_type_str == "INT64":
        return "bigint"
    elif bigquery_type_str == "FLOAT":
        return "float"
    elif bigquery_type_str == "FLOAT64":
        return "double"
    elif bigquery_type_str == "BOOLEAN" or bigquery_type_str == "BOOL":
        return "boolean"
    elif bigquery_type_str == "TIMESTAMP":
        return "timestamp"
    elif bigquery_type_str == "DATE":
        return "date"
    elif bigquery_type_str == "TIME":
        return "timestamp_ntz"
    elif bigquery_type_str == "DATETIME":
        return "timestamp"
    elif bigquery_type_str == "NUMERIC":
        return "numeric"
    elif bigquery_type_str == "BIGNUMERIC":
        return "double"
    elif bigquery_type_str == "GEOGRAPHY":
        return "object"
    elif bigquery_type_str == "JSON":
        return "object"
    else:
        raise DataContractException(
            type="schema",
            result="failed",
            name="Map bigquery type to data contract type",
            reason=f"Unsupported type {bigquery_type_str} in bigquery json definition.",
            engine="datacontract",
        )

Do you have any suggestions on how to address this? I’m happy to collaborate and create a PR for this, but I’d like to avoid introducing inconsistencies with other contract types.

simonharrer commented 2 months ago

The Data Contract Specification has its own (logical) type system: https://datacontract.com/#data-types

If a type cannot be mapped, you can use the config object to specify a physical type in addition to the logical type. Have a look at https://datacontract.com/#config-object which also specifies a bigqueryType field. The importer/exporter should use them. It could be, that this is not yet implemented in the CLI.

ffernandez92 commented 2 months ago

Oh! thanks. It seems to be sort of implemented here:

def convert_type_to_bigquery(field: Field) -> None | str:
    """Convert from supported datacontract types to equivalent bigquery types"""
    field_type = field.type
    if not field_type:
        return None

    # If provided sql-server config type, prefer it over default mapping
    if bigquery_type := get_type_config(field, "bigqueryType"):
        return bigquery_type

    field_type = field_type.lower()
    return map_type_to_bigquery(field_type, field.title)

But it doesn't seem to work fine. I tried testing the config bit

models:
  app: 
    description: App dimensions
    type: table
    fields:
      dt:
        name: test
        type: string
        description: some test.
        config:
          bigqueryType: datetime

But the result from the export is:

{
        "name": "test",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "some test.",
        "maxLength": null
      }
simonharrer commented 2 months ago

The culprit is in the datacontract/export/bigquery_converter.py file, as this mapping does not take the config into account. Perhaps one could even consolidate here. Do you have time to provide a fix yourself with a PR @ffernandez92 ?

ffernandez92 commented 2 months ago

Sure, no problem! I'm not very familiar with the codebase, so it might take me a bit longer than someone else, but I'm happy to help and collaborate further! :)

ffernandez92 commented 2 months ago

I've created this: https://github.com/datacontract/datacontract-cli/pull/422 . I'd like to test locally first though. I haven't seen a testing guide in the repo. Any tips here?

simonharrer commented 2 months ago

Did you run the tests locally already to prevent regressions? This is described in the README. Regarding tests, have a look at the existing tests. In your case, you could adopt an existing exporter test (see tests/test_export_bigquery.py). We normally test by simply having a datacontract.yaml as input, and the expected output (e.g., BigQuery JSON), perform the export, and check whether the generated BigQuery JSON is as expected. Does this already help?

ffernandez92 commented 2 months ago

Thanks, it helped! . While I was testing different data types I've noticed another issue in the bq_exporter. Whenever there is an array of a primitive type (example string) it generates an incorrect schema. This is the sample of the contract:

models:
  app: # matches with bigquery table name
    description: App dimensions
    type: table
    fields:
      (..others..)
      test_array:
        type: array
        items:
          type: string
        description: none.

The resulting schema is:

{
        "name": "test_array",
        "type": "RECORD",
        "mode": "REPEATED",
        "description": "none.",
        "fields": [
          {
            "name": "test_array_1",
            "type": "STRING",
            "mode": "NULLABLE",
            "description": "",
            "maxLength": null
          }
        ]
      }

This is incorrect because BQ assumes that as a repeated (array) of objects of 1 element. The right schema should look like this:

{
        "mode": "REPEATED",
        "name": "test_array",
        "type": "STRING"
        "description": "none."
      },

Could you confirm that this is actually an issue and not me doing something wrong? If that's the case, I'll try to take a look and investigate if i can fix that too.

ffernandez92 commented 2 months ago

@simonharrer I think I'll create a different issue to address the arrays. For now #422 works and is passing the tests. Now the user has the possibility to pick a custom BQ data type.