datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
278 stars 41 forks source link

Generic data model #13

Closed jochenchrist closed 10 months ago

jochenchrist commented 1 year ago

The bring-your-own-schema has some disadvantages:

The idea is to use a generic data model.

This could look like:

Option 1 ("dbt-flavored")

models:
  - name: orders
    type: table
    description: >
      One record per order. Includes cancelled and deleted orders.
    columns:
      - name: order_id
        $ref: https://demo.datamesh-manager.com/acme/definitions/checkout/orderId
      - name: order_timestamp
        type: timestamptz
        description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
      - name: order_total
        type: integer
        description: "Total amount of the order in the smallest monetary unit (e.g., cents)."
        constraints:
          - type: not_null
  - name: line_items
    type: table
    description: >
      The items that are part of an order
    columns:
      - name: lines_item_id
        type: string
        description: Primary key of the lines_item_id table
        constraints:
          - type: primary_key
          - type: unique
      - name: order_id
        type: string
        description: Foreign key to the orders table
      - name: sku
        type: string
        description: The purchased article number

Option 2 (OpenAPI / JSON Schema flavored)

models:
  orders:
    description: One record per order. Includes cancelled and deleted orders.
    type: table   # alternatives: view?, object, graph ...
    fields:
      order_id:
        $ref: https://datamesh-manager.com/acme/definitions/checkout/orderId
      order_timestamp:
        type: string
        format: date-time
        description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
      order_total:
        type: integer
        description: Total amount of the order in the smallest monetary unit (e.g., cents).
  line_items:
    type: table
    fields:
      lines_item_id:
        type: string
        description: Primary key of the lines_item_id table
        constraints:
          - type: primary_key
          - type: unique
      order_id:
        $ref: https://datamesh-manager.com/acme/definitions/checkout/orderId
      sku:
        type: string
        description: The purchased article number

Option 3 (ODCS-flavoured)

dataset:
  - table: orders
    description: One record per order. Includes cancelled and deleted orders.
    columns:
      - column: order_id
      - column: order_timestamp
        logicalType: timestamptz
        description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
      - column: order_total
        logicalType: integer
jochenchrist commented 1 year ago

The idea is that the technology-specific schema (dbt, bigquery, dataframe, json, avro...) can be imported and exported with the Data Contract CLI based on the generic model.

jochenchrist commented 1 year ago

Option 2 (OpenAPI / JSON Schema flavored) might have issues in defining a JSON-Schema to support model creation.

jochenchrist commented 1 year ago

The data model should also embrace streaming structures (messages over Kafka). So the name columns might be difficult. Alternatives: fields, properties, attributes.

jochenchrist commented 1 year ago

Option 2 it is to be in line with OpenAPI conventions.

Data models are different from JSON-Schemas, so we decided to adopt properties, where appropriate.

We opt for the fields name instead of properties or columns, as we think this is a good choice for tables, messages, objects, and documents.

simonharrer commented 10 months ago

Choice has been made. Closing this.