datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
263 stars 39 forks source link

Lineage #90

Open jochenchrist opened 2 months ago

jochenchrist commented 2 months ago

Support for field-level upstream lineage could be aligned to OpenLinage format for "inputFields" (https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/)

First draft, open to discuss:

fields:
  order_id:
    type: string
    lineage: # new property
      inputFields:
        - namespace: food_delivery
          name: public.delivery_7_days
          field: order_id
          transformations:
            - type: DIRECT
              subtype: IDENTITY
              description: ""
              masking: false
        - namespace: food_delivery
          name: public.delivery_7_days
          field: order_placed_on
          transformations:
            - type: INDIRECT
              subtype: SORT
              description: ""
              masking: false
        - namespace: food_delivery
          name: public.delivery_7_days
          field: order_delivered_on
          transformations:
            - type: INDIRECT
              subtype: SORT
              description: ""
              masking: false
jochenchrist commented 1 month ago

Another example:

models:
  orders:
    description: One record per order. Includes cancelled and deleted orders.
    type: table
    fields:
      order_id:
        $ref: '#/definitions/order_id'
        required: true
        unique: true
        primary: true
        lineage:
          inputFields:
          - namespace: kafka
            name: com.example.checkout.order.v1 # kafka topic name
            field: orderId
            transformations:
              - type: DIRECT
                subtype: IDENTITY
          - namespace: kafka
            name: com.example.checkout.order.v1 # kafka topic name
            field: status
            transformations:
              - type: INDIRECT
                subtype: FILTER
                description: filtered on status = 'successful'
      order_timestamp:
        description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
        type: timestamp
        required: true
        example: "2024-09-09T08:30:00Z"
        lineage:
          inputFields:
          - namespace: kafka
            name: com.example.checkout.order.v1 # kafka topic name
            field: orderPaymentSucessful
            transformations:
              - type: DIRECT
                subtype: IDENTITY
      order_total:
        description: Total amount the smallest monetary unit (e.g., cents).
        type: long
        required: true
        example: "9999"
        lineage:
          inputFields:
          - namespace: kafka
            name: com.example.checkout.order.v1 # kafka topic name
            field: lineitems[].price
            transformations:
              - type: DIRECT
                subtype: AGGREGATION
                description: aggregated all line item prices
      customer_id:
        description: Unique identifier for the customer.
        type: text
        minLength: 10
        maxLength: 20
      customer_email_address:
        description: The email address, as entered by the customer.
        type: text
        format: email
        required: true
        pii: true
        classification: sensitive
        lineage:
          inputFields:
          - namespace: kafka
            name: com.example.checkout.order.v1 # kafka topic name
            field: emailAddress
            transformations:
              - type: DIRECT
                subtype: TRANSFORMATION
                masking: true