Open-Data-Product-Initiative / dev

Open Data Product Specification Development version
https://opendataproducts.org/dev/
Apache License 2.0
0 stars 0 forks source link

Describe Primary data origin #1

Open kyyberi opened 3 months ago

kyyberi commented 3 months ago

It is data genesis!

Which problem is this feature request solving?

Describe the solution you'd like

  1. A way to surely identify define the source, provide the mechanism to describe it
  2. What kind of steps in between. From where it started and what was the path. Verify the stakeholders?
  3. What kind of entities: machine, human, or something else is related
  4. Where and by what exactly the data was created, prove it. Authenticity
  5. This is something similar to observability, but this is about describing the context among other things from the actual sources behind the value chain (backward)

Any known practical use cases to apply?

Yes we do. It will provide business value. This is coming from practitioners. Details not revealed yet to protect frontrunner business

Can you submit a pull request?

No.

---- Leave intact! Approval of Contributor Agreement -----

By submitting issue you approve the Contributor Agreement, https://governance.opendataproducts.org/v1/contributions/contributor-agreement

kyyberi commented 3 months ago

Implemented first version of this in the DataOps component. The origin attempts to describe the sources of data


dataOps:
  data:
    schemaLocationURL: http://http://192.168.10.1/schemas/2016/petshopML-2.3/schema/petstore.xsd
    origin:
      - source: human # sensor, human, analytics
        sourceId: 
        type: raw # raw, cleansed
        description: 
        checksum: # ?
      - source: sensor # sensor, human, analytics
        sourceId: 
        type: cleansed # raw, cleansed
        description: 
        checksum: # ?
    lineage:
      dataLineageTool: Collibra
      dataLineageOutput: http://192.168.10.1/lineage.json

  infrastructure:
    platform: Azure
    region: West US 2 (Washington)
    storageTechnology: Azure SQL
    storageType: sql
    containerTool: helm

  build:
    format: yaml
    hashType: SHA-2
    checksum: 7b7444ab8f5832e9ae8f54834782af995d0a83b4a1d77a75833eda7e19b4c921
    signatureType: JWK
    scriptURL: http://192.168.10.1/rundatapipeline.yml
    deploymentDocumentationURL: http://192.168.10.1/datapipeline
kyyberi commented 3 months ago

Can we somehow add here "as code" part as well? A method to verify the source system and authenticity of the data directly?

Something similar to what is in data quality

dataQuality:
  - dimension: accuracy
    objective: 98
    unit: percentage
    monitoring:
      type: SodaCL 
      spec:
        - require_unique(member_id) 
        - require_range(age_band, 18, 100)
kyyberi commented 2 months ago

To be moved to the next version.