MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Proposal: Spec `Lineage.NodeID` #2949

Open wslulciuc opened 4 weeks ago

wslulciuc commented 4 weeks ago

Recently, we've seen various bugs reported for NodeID parsing issues:

A NodeID consists of multiple parts (i.e. metadata) delimited by a colon (:). A NodeID can be of type: dataset, jobs, etc with the following parts:

<type>:<namespace>:<name>
or, <type>:<namespace>:<name>#<version>

We defined a NodeIDin this way in order to encode metadata about the node type and ensure global unique IDs; the LineageAPI returns graph nodes with all metadata associated with that given node type. For example, below is the NodeID for dataset food_delivery:public.delivery_7_days:

dataset:food_delivery:public.delivery_7_days

where, food_delivery is the namespace, and public.delivery_7_days is the name of the dataset. A call to the LineageAPI will return the graph node:

{
  "id": "dataset:food_delivery:public.delivery_7_days",
  "type": "DATASET",
  "data": {
    "id": { "namespace": "food_delivery", "name": "public.delivery_7_days" },
    "type": "DB_TABLE",
    "name": "public.delivery_7_days",
    "physicalName": "public.delivery_7_days",
    "createdAt": "2024-10-24T19:27:05Z",
    "updatedAt": "2024-10-24T22:36:06Z",
    "namespace": "food_delivery",
    "sourceName": "food_delivery_db",
    "fields": [
      { "name": "order_id", "type": "INTEGER", "description": "The ID of the order." },
      { "name": "order_placed_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for when the order was placed." },
      { "name": "order_dispatched_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for dispatch." },
      { "name": "order_delivered_on", "type": "TIMESTAMP", "description": "ISO-8601 timestamp for delivery." },
      { "name": "customer_email", "type": "VARCHAR", "description": "Customer's email address." },
      { "name": "customer_address", "type": "VARCHAR", "description": "Customer's physical address." },
      { "name": "menu_id", "type": "INTEGER", "description": "ID of the related menu." },
      { "name": "restaurant_id", "type": "INTEGER", "description": "ID of the restaurant." },
      { "name": "restaurant_address", "type": "VARCHAR", "description": "Restaurant's address." },
      { "name": "menu_item_id", "type": "INTEGER", "description": "ID of the menu item." },
      { "name": "category_id", "type": "INTEGER", "description": "ID of the category." },
      { "name": "discount_id", "type": "INTEGER", "description": "ID of the discount." },
      { "name": "city_id", "type": "INTEGER", "description": "ID of the city." },
      { "name": "driver_id", "type": "INTEGER", "description": "ID of the driver." }
    ],
    "tags": [],
    "lastModifiedAt": null,
    "description": null,
    "lastLifecycleState": ""
  },
  "inEdges": [
    { "origin": "job:food_delivery:etl_delivery_7_days", "destination": "dataset:food_delivery:public.delivery_7_days" }
  ],
  "outEdges": [
    { "origin": "dataset:food_delivery:public.delivery_7_days", "destination": "job:food_delivery:delivery_times_7_days" }
  ]
}

Error on NodeId.parse()

But, what if the namespace contains a colon :? Our NodeId.parse() method errors (not fun!). For example, node parsing will error for the namespace:

trino://trino-integration-test:1337

We need to move away from NodeId with encoded metadata (no longer needed as we move towards a light-weight lineage graph response -- just nodes and edges).

Use UUIDs as NodeIDs

Let's move to using UUIDs for NodeIDs when the lineage graph returns just nodes and edges an supports the following lineage graphs: