Closed jochenchrist closed 1 month ago
Open pull-request: https://github.com/datacontract/datacontract-specification/pull/91
@ArtyomyuS I'm wondering if this is not just a quality check. A quality check that ensures that country_code = 'DE'
in the model of a data contract. One could mark this as a filter so it can be used for code generation in your case. This would make the addition of a filter
element superfluous. What do you think?
Hey @simonharrer , by considering quality check for this aspect I guess yes filter it is not required anymore. However in this particular case. If the checks apply to the model by creating the data product, then all checks should apply. Not sure all of them should be part of the code generation. Or should they all be? :)
Lets see by example:
models:
orders:
description: Orders made by a customer
type: table
fields:
Timestamp:
type: timestamp
required: true
description: Time event occurred.
example: "1725264051666"
Transaction_ID:
type: string
required: true
description: Unique transaction identifier of the order
Customer_ID:
type: string
required: true
description: The customer who made the purchase
example: "23332"
quality:
type: SodaCL
specification:
checks for orders:
- row_count >= 1:
name: Ensure the data product is not empty.
- freshness(Timestamp) <= 1h:
name: Data in this dataset is within 1 hour old.
- duplicate_count(Transaction_ID) = 0
- failed(Customer_ID = '999') = 0:
name: Ensure Customer_ID is equal 999.
So I guess this should generate a SQL as:
SELECT Timestamp, Transaction_ID, Customer_ID
FROM orders
WHERE
Timestamp >= NOW() - INTERVAL '1 hour'
AND Customer_ID = '999';
The question is how to distinguish between quality checks that could be used to generate SQL query and which one should not be used. Or am I asking something that is not relevant for the data contracts?
Hey @jochenchrist and @simonharrer any updates on the above? Thanks 🙏
In the upcoming release v1.1.0, we are also planning to support lineage: https://github.com/datacontract/datacontract-specification/tree/develop/v1?tab=readme-ov-file#lineage-object
A filter condition is also a lineage FILTER transformation:
lineage:
inputFields:
- namespace: com.example.service.checkout
name: checkout_db.orders
field: order_id
transformations:
- type: DIRECT
subtype: IDENTITY
description: The order ID from the checkout order
- namespace: com.example.service.checkout
name: checkout_db.orders
field: region
- type: INDIRECT
subtype: FILTER
description: region = 'US'
I think quality and lineage are sufficient for the use case.
If you need an SQL condition for your use case or tooling, you can consider specifying a config option:
model:
orders_emea:
fields: ...
config:
sqlWhereCondition: country_code in ['DE', 'AT', ...]
Nice one, thank you for the clarifications. The config with sql where condition should solve the problem. 👍
Mark as wontfix, as we have identified better options.
Describe, which data in included in a model.