datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
276 stars 41 forks source link

Scope / Filter / Selection of a data model #67

Closed jochenchrist closed 1 month ago

jochenchrist commented 6 months ago

Describe, which data in included in a model.

model:
  orders_emea:
    fields: ...
    filter: 
      expression: country_code in ['DE', 'AT', ...]
ArtyomyuS commented 2 months ago

Open pull-request: https://github.com/datacontract/datacontract-specification/pull/91

simonharrer commented 2 months ago

@ArtyomyuS I'm wondering if this is not just a quality check. A quality check that ensures that country_code = 'DE' in the model of a data contract. One could mark this as a filter so it can be used for code generation in your case. This would make the addition of a filter element superfluous. What do you think?

ArtyomyuS commented 2 months ago

Hey @simonharrer , by considering quality check for this aspect I guess yes filter it is not required anymore. However in this particular case. If the checks apply to the model by creating the data product, then all checks should apply. Not sure all of them should be part of the code generation. Or should they all be? :)

Lets see by example:

models:
  orders:
    description: Orders made by a customer
    type: table
    fields:
      Timestamp:
        type: timestamp
        required: true
        description: Time event occurred.
        example: "1725264051666"
      Transaction_ID:
        type: string
        required: true
        description: Unique transaction identifier of the order
      Customer_ID:
        type: string
        required: true
        description: The customer who made the purchase 
        example: "23332"
quality:
  type: SodaCL 
  specification:
    checks for orders:
      - row_count >= 1:
          name: Ensure the data product is not empty.
      - freshness(Timestamp) <= 1h:
          name: Data in this dataset is within 1 hour old.
      - duplicate_count(Transaction_ID) = 0
      - failed(Customer_ID = '999') = 0:
          name: Ensure Customer_ID is  equal 999.

So I guess this should generate a SQL as:

SELECT Timestamp, Transaction_ID, Customer_ID
FROM orders
WHERE 
  Timestamp >= NOW() - INTERVAL '1 hour'
  AND Customer_ID = '999';

The question is how to distinguish between quality checks that could be used to generate SQL query and which one should not be used. Or am I asking something that is not relevant for the data contracts?

ArtyomyuS commented 2 months ago

Hey @jochenchrist and @simonharrer any updates on the above? Thanks 🙏

jochenchrist commented 2 months ago

In the upcoming release v1.1.0, we are also planning to support lineage: https://github.com/datacontract/datacontract-specification/tree/develop/v1?tab=readme-ov-file#lineage-object

A filter condition is also a lineage FILTER transformation:

        lineage:
          inputFields:
            - namespace: com.example.service.checkout
              name: checkout_db.orders
              field: order_id
              transformations:
                - type: DIRECT
                  subtype: IDENTITY
                  description: The order ID from the checkout order
            - namespace: com.example.service.checkout
              name: checkout_db.orders
              field: region
                - type: INDIRECT
                  subtype: FILTER
                  description: region = 'US'
jochenchrist commented 2 months ago

I think quality and lineage are sufficient for the use case.

If you need an SQL condition for your use case or tooling, you can consider specifying a config option:

model:
  orders_emea:
    fields: ...
    config:
      sqlWhereCondition: country_code in ['DE', 'AT', ...]
ArtyomyuS commented 1 month ago

Nice one, thank you for the clarifications. The config with sql where condition should solve the problem. 👍

jochenchrist commented 1 month ago

Mark as wontfix, as we have identified better options.