MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

[UI] The graph depth features doesn't seem consistent #2893

Closed dkt-sophie-ly closed 1 month ago

dkt-sophie-ly commented 2 months ago

I'm not sure if it's a real issue or a misunderstandings about this feature but I feel like the change of depth on the lineage is quite inconsistant.

On this first example:

We focus on the dataset surrounded in yellow, now depth=2 so we can see 2 datasets above the one focused and 3 behind:

image

So if I click on dataset located in depth= -2 and i put the parameter depth=4 or more (6 here) I expect to get the same elements in the lineage as the fist image (dataset at depth=1 or 2) but this is not the case.

image

If the first lineage we have 3 datasets at depth=2 but on the second image at depth=6 I have only 2 dataset and we expect at least 3 dataset as in image 1).

So I don't know if it's the expected behaviour or not

Here on the second example:

We focus on the dataset surround in yellow (dataset2) and the depth = 2, we can see this very simple lineage with dataset 1 and 2.

image

Si if I click on dataset1 with depth=2 I expect to this the very same lineage, but it appears alone like if it's linked to nothing:

image

This seems more like a bug than the first case

Let me know what you think :)

phixMe commented 2 months ago

Hey @dkt-sophie-ly

Yes, this is indeed difficult to reason about. @julienledem and I have had a few in depth (no pun intended 😄) discussions about this and what we would like this depth parameters to work like. I see that you've noticed were not node centric and perhaps we should be to make sense of this. The depth parameter is the count of recursive calls that we do on the query posted below, so the UI is really proxying the query response and pretty dumb in that sense.

At this time it heavily anchored to the lineage query itself and also it's important to understand that lineage is job centric (not dataset centric) as we find the attached job to a dataset before executing.

I think the biggest source of the confusion which is a little opaque to me from your screenshots is that you are in the "Critical Path" display mode which removes nodes that are not directly connected to your selected node, but I think you can still be influenced in terms of depth nodes not on the critical path. That toggle switch is UI only and simply filters nodes so we still have the other nodes, just are not rendering them.

Lineage query

WITH RECURSIVE
  column_lineage_latest AS (
      SELECT DISTINCT ON (output_dataset_field_uuid, input_dataset_field_uuid) *
      FROM column_lineage
      WHERE created_at <= :createdAtUntil
      ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
  ),
  dataset_fields_view AS (
    SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid, d.namespace_uuid
    FROM dataset_fields df
    INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
  ),
  column_lineage_recursive AS (
    (
      SELECT
        *,
        0 as depth,
        false as is_cycle,
        ARRAY[ROW(output_dataset_field_uuid, input_dataset_field_uuid)] as path -- path and is_cycle mechanism as describe here https://www.postgresql.org/docs/current/queries-with.html (CYCLE clause not available in postgresql 12)
      FROM column_lineage_latest
      WHERE output_dataset_field_uuid IN (<datasetFieldUuids>)
    )
    UNION ALL
    SELECT
      adjacent_node.output_dataset_version_uuid,
      adjacent_node.output_dataset_field_uuid,
      adjacent_node.input_dataset_version_uuid,
      adjacent_node.input_dataset_field_uuid,
      adjacent_node.transformation_description,
      adjacent_node.transformation_type,
      adjacent_node.created_at,
      adjacent_node.updated_at,
      node.depth + 1 as depth,
      ROW(adjacent_node.input_dataset_field_uuid, adjacent_node.output_dataset_field_uuid) = ANY(path) as is_cycle,
      path || ROW(adjacent_node.input_dataset_field_uuid, adjacent_node.output_dataset_field_uuid) as path
    FROM column_lineage_latest adjacent_node, column_lineage_recursive node
    WHERE (
      (node.input_dataset_field_uuid = adjacent_node.output_dataset_field_uuid) --upstream lineage
      OR (:withDownstream AND adjacent_node.input_dataset_field_uuid = node.output_dataset_field_uuid) --optional downstream lineage
    )
    AND node.depth < :depth - 1 -- fetching single row means fetching single edge which is size 1
    AND NOT is_cycle
  )
  SELECT
      output_fields.namespace_name,
      output_fields.dataset_name,
      output_fields.field_name,
      output_fields.type,
      ARRAY_AGG(DISTINCT ARRAY[
        input_fields.namespace_name,
        input_fields.dataset_name,
        CAST(clr.input_dataset_version_uuid AS VARCHAR),
        input_fields.field_name,
        clr.transformation_description,
        clr.transformation_type
      ]) AS inputFields,
      clr.output_dataset_version_uuid as dataset_version_uuid
  FROM column_lineage_recursive clr
  INNER JOIN dataset_fields_view output_fields ON clr.output_dataset_field_uuid = output_fields.uuid -- hidden datasets will be filtered
  INNER JOIN dataset_symlinks ds_output ON ds_output.namespace_uuid = output_fields.namespace_uuid AND ds_output.name = output_fields.dataset_name
  LEFT JOIN dataset_fields_view input_fields ON clr.input_dataset_field_uuid = input_fields.uuid
  INNER JOIN dataset_symlinks ds_input ON ds_input.namespace_uuid = input_fields.namespace_uuid AND ds_input.name = input_fields.dataset_name
  WHERE NOT clr.is_cycle AND ds_output.is_primary is true AND ds_input.is_primary
  GROUP BY
      output_fields.namespace_name,
      output_fields.dataset_name,
      output_fields.field_name,
      output_fields.type,
      clr.output_dataset_version_uuid
dkt-sophie-ly commented 1 month ago

Thanks for the explanation, i'll look into that ! :)