Closed dkt-sophie-ly closed 1 month ago
Hey @dkt-sophie-ly
Yes, this is indeed difficult to reason about. @julienledem and I have had a few in depth (no pun intended 😄) discussions about this and what we would like this depth
parameters to work like. I see that you've noticed were not node centric and perhaps we should be to make sense of this. The depth
parameter is the count of recursive calls that we do on the query posted below, so the UI is really proxying the query response and pretty dumb in that sense.
At this time it heavily anchored to the lineage query itself and also it's important to understand that lineage is job centric (not dataset centric) as we find the attached job to a dataset before executing.
I think the biggest source of the confusion which is a little opaque to me from your screenshots is that you are in the "Critical Path" display mode which removes nodes that are not directly connected to your selected node, but I think you can still be influenced in terms of depth nodes not on the critical path. That toggle switch is UI only and simply filters nodes so we still have the other nodes, just are not rendering them.
WITH RECURSIVE
column_lineage_latest AS (
SELECT DISTINCT ON (output_dataset_field_uuid, input_dataset_field_uuid) *
FROM column_lineage
WHERE created_at <= :createdAtUntil
ORDER BY output_dataset_field_uuid, input_dataset_field_uuid, updated_at DESC, updated_at
),
dataset_fields_view AS (
SELECT d.namespace_name as namespace_name, d.name as dataset_name, df.name as field_name, df.type, df.uuid, d.namespace_uuid
FROM dataset_fields df
INNER JOIN datasets_view d ON d.uuid = df.dataset_uuid
),
column_lineage_recursive AS (
(
SELECT
*,
0 as depth,
false as is_cycle,
ARRAY[ROW(output_dataset_field_uuid, input_dataset_field_uuid)] as path -- path and is_cycle mechanism as describe here https://www.postgresql.org/docs/current/queries-with.html (CYCLE clause not available in postgresql 12)
FROM column_lineage_latest
WHERE output_dataset_field_uuid IN (<datasetFieldUuids>)
)
UNION ALL
SELECT
adjacent_node.output_dataset_version_uuid,
adjacent_node.output_dataset_field_uuid,
adjacent_node.input_dataset_version_uuid,
adjacent_node.input_dataset_field_uuid,
adjacent_node.transformation_description,
adjacent_node.transformation_type,
adjacent_node.created_at,
adjacent_node.updated_at,
node.depth + 1 as depth,
ROW(adjacent_node.input_dataset_field_uuid, adjacent_node.output_dataset_field_uuid) = ANY(path) as is_cycle,
path || ROW(adjacent_node.input_dataset_field_uuid, adjacent_node.output_dataset_field_uuid) as path
FROM column_lineage_latest adjacent_node, column_lineage_recursive node
WHERE (
(node.input_dataset_field_uuid = adjacent_node.output_dataset_field_uuid) --upstream lineage
OR (:withDownstream AND adjacent_node.input_dataset_field_uuid = node.output_dataset_field_uuid) --optional downstream lineage
)
AND node.depth < :depth - 1 -- fetching single row means fetching single edge which is size 1
AND NOT is_cycle
)
SELECT
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type,
ARRAY_AGG(DISTINCT ARRAY[
input_fields.namespace_name,
input_fields.dataset_name,
CAST(clr.input_dataset_version_uuid AS VARCHAR),
input_fields.field_name,
clr.transformation_description,
clr.transformation_type
]) AS inputFields,
clr.output_dataset_version_uuid as dataset_version_uuid
FROM column_lineage_recursive clr
INNER JOIN dataset_fields_view output_fields ON clr.output_dataset_field_uuid = output_fields.uuid -- hidden datasets will be filtered
INNER JOIN dataset_symlinks ds_output ON ds_output.namespace_uuid = output_fields.namespace_uuid AND ds_output.name = output_fields.dataset_name
LEFT JOIN dataset_fields_view input_fields ON clr.input_dataset_field_uuid = input_fields.uuid
INNER JOIN dataset_symlinks ds_input ON ds_input.namespace_uuid = input_fields.namespace_uuid AND ds_input.name = input_fields.dataset_name
WHERE NOT clr.is_cycle AND ds_output.is_primary is true AND ds_input.is_primary
GROUP BY
output_fields.namespace_name,
output_fields.dataset_name,
output_fields.field_name,
output_fields.type,
clr.output_dataset_version_uuid
Thanks for the explanation, i'll look into that ! :)
I'm not sure if it's a real issue or a misunderstandings about this feature but I feel like the change of depth on the lineage is quite inconsistant.
On this first example:
We focus on the dataset surrounded in yellow, now depth=2 so we can see 2 datasets above the one focused and 3 behind:
So if I click on dataset located in depth= -2 and i put the parameter depth=4 or more (6 here) I expect to get the same elements in the lineage as the fist image (dataset at depth=1 or 2) but this is not the case.
If the first lineage we have 3 datasets at depth=2 but on the second image at depth=6 I have only 2 dataset and we expect at least 3 dataset as in image 1).
So I don't know if it's the expected behaviour or not
Here on the second example:
We focus on the dataset surround in yellow (dataset2) and the depth = 2, we can see this very simple lineage with dataset 1 and 2.
Si if I click on dataset1 with depth=2 I expect to this the very same lineage, but it appears alone like if it's linked to nothing:
This seems more like a bug than the first case
Let me know what you think :)