Closed ffernandez92 closed 2 months ago
The Data Contract Specification has its own (logical) type system: https://datacontract.com/#data-types
If a type cannot be mapped, you can use the config
object to specify a physical type in addition to the logical type. Have a look at https://datacontract.com/#config-object which also specifies a bigqueryType
field. The importer/exporter should use them. It could be, that this is not yet implemented in the CLI.
Oh! thanks. It seems to be sort of implemented here:
def convert_type_to_bigquery(field: Field) -> None | str:
"""Convert from supported datacontract types to equivalent bigquery types"""
field_type = field.type
if not field_type:
return None
# If provided sql-server config type, prefer it over default mapping
if bigquery_type := get_type_config(field, "bigqueryType"):
return bigquery_type
field_type = field_type.lower()
return map_type_to_bigquery(field_type, field.title)
But it doesn't seem to work fine. I tried testing the config bit
models:
app:
description: App dimensions
type: table
fields:
dt:
name: test
type: string
description: some test.
config:
bigqueryType: datetime
But the result from the export is:
{
"name": "test",
"type": "STRING",
"mode": "NULLABLE",
"description": "some test.",
"maxLength": null
}
The culprit is in the datacontract/export/bigquery_converter.py
file, as this mapping does not take the config into account. Perhaps one could even consolidate here. Do you have time to provide a fix yourself with a PR @ffernandez92 ?
Sure, no problem! I'm not very familiar with the codebase, so it might take me a bit longer than someone else, but I'm happy to help and collaborate further! :)
I've created this: https://github.com/datacontract/datacontract-cli/pull/422 . I'd like to test locally first though. I haven't seen a testing guide in the repo. Any tips here?
Did you run the tests locally already to prevent regressions? This is described in the README. Regarding tests, have a look at the existing tests. In your case, you could adopt an existing exporter test (see tests/test_export_bigquery.py
). We normally test by simply having a datacontract.yaml as input, and the expected output (e.g., BigQuery JSON), perform the export, and check whether the generated BigQuery JSON is as expected. Does this already help?
Thanks, it helped! . While I was testing different data types I've noticed another issue in the bq_exporter. Whenever there is an array of a primitive type (example string) it generates an incorrect schema. This is the sample of the contract:
models:
app: # matches with bigquery table name
description: App dimensions
type: table
fields:
(..others..)
test_array:
type: array
items:
type: string
description: none.
The resulting schema is:
{
"name": "test_array",
"type": "RECORD",
"mode": "REPEATED",
"description": "none.",
"fields": [
{
"name": "test_array_1",
"type": "STRING",
"mode": "NULLABLE",
"description": "",
"maxLength": null
}
]
}
This is incorrect because BQ assumes that as a repeated (array) of objects of 1 element. The right schema should look like this:
{
"mode": "REPEATED",
"name": "test_array",
"type": "STRING"
"description": "none."
},
Could you confirm that this is actually an issue and not me doing something wrong? If that's the case, I'll try to take a look and investigate if i can fix that too.
@simonharrer I think I'll create a different issue to address the arrays. For now #422 works and is passing the tests. Now the user has the possibility to pick a custom BQ data type.
I've noticed some inconsistencies in the data types used by the BigQuery converter. In BigQuery, there are additional data types not accounted for in the converter:
Initially, I considered adding extra if statements to the
map_type_to_bigquery
function indatacontract/export/bigquery_converter.py
, but that would introduce inconsistencies for other types of contracts.Example:
(DATETIME is not there, is just to show the example).
In the importer, both json and datetime exist. However, as you can see, both timestamp and datetime are mapped to timestamp internally in the datacontract data types.
Importer method:
Do you have any suggestions on how to address this? I’m happy to collaborate and create a PR for this, but I’d like to avoid introducing inconsistencies with other contract types.