datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
416 stars 80 forks source link

Checking for Databricks ARRAY<STRING> #318

Open Alp-Edeka opened 1 month ago

Alp-Edeka commented 1 month ago

I am using Databricks and trying to test fields in my table that include arrays. My contract is as follows:

servers:
  test:
    type: dataframe
models:
  test_table:
    description: Test description.
    type: table
    fields:
      test_field:
        required: true
        description: Another description.
        type: array
        title: Test
        required: false
        example: '[''02'',''03'']'
        items:
          type: string
          description: Last description.

My table is defined as:

create or replace temporary view test_table as
select
  from_json(test_field,'ARRAY<STRING>') test_field
from another_test_table

Now using datacontract-cli inside Databricks throws the following output and my contract fails:

Type Mismatch, Expected Type: array; Actual Type: array<string>
Column,Event,Details
test_field,:icon-fail: Type Mismatch, Expected Type: array; Actual Type: array<string>

How can I actually check those type of fields inside Databricks?

jochenchrist commented 1 month ago

Can you try this as a workaround:

servers:
  test:
    type: dataframe
models:
  test_table:
    description: Test description.
    type: table
    fields:
      test_field:
        required: true
        description: Another description.
        type: array
        title: Test
        required: false
        example: '[''02'',''03'']'
        items:
          type: string
          description: Last description.
        config:
          databricksType: array<string>
Alp-Edeka commented 1 month ago

Thank you for your response @jochenchrist. Unfortunately, I am still getting the same type mismatch. I am assuming that adding something here could fix my issue:

https://github.com/datacontract/datacontract-cli/blob/7def89252057a6055a6b154fdfd3f14419767a85/datacontract/export/sql_type_converter.py#L113

jochenchrist commented 1 month ago

OK, need to dig deeper in here (https://github.com/datacontract/datacontract-cli/blob/7def89252057a6055a6b154fdfd3f14419767a85/datacontract/export/sql_type_converter.py#L114 should respect the config option).

Just to make sure: Are you using the latest version of the CLI tool?

Alp-Edeka commented 1 month ago

@jochenchrist I am using version 0.10.7, not the latest one.

Alp-Edeka commented 1 month ago

"More simple" data types seem to also have the same issue. For example, the data contract

servers:
  test:
    type: dataframe
models:
  test_model:
    description: Test description 1.
    type: table 
    fields:
      test_field:
        required: true
        description: Test description 2.
        type: timestamp_tz
        example: "2024-06-01T12:00:00.000Z"
        config:
          databricksType: timestamp

throws the output:

Column,Event,Details
test_field,:icon-fail: Type Mismatch, Expected Type: timestamp_tz; Actual Type: timestamp
jochenchrist commented 1 month ago

Just to get sure, could you try testing with latest version v0.10.9? (you might need to install with extras pip install datacontract-cli[all] --upgrade)

Alp-Edeka commented 1 month ago

I now tried

%pip install datacontract-cli[all] --upgrade
dbutils.library.restartPython()

inside a notebook and ran the test again with the same outcome.