Open collado-mike opened 2 years ago
Of note - it is expected that both the version
column and the uuid
column are unique in the dataset_versions
table (see https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L156-L158 and https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L198-L200 - both queries are expected to return a single result). What's the point in having both of these columns? I imagine the expectation is that the version
column could be used to track external versions (e.g., delta or iceberg dataset version ids), but in that case, we can't expect the version
to be unique. And in that case, shouldn't the DatsetVersionId
always be the fully qualified external dataset id (that is the external version)?
However, in the RunDao, when we construct the DatasetVersionId of the inputs and outputs of a Run, we use the version property
The DatasetVersionId
should always be using the uuid
column in dataset_versions
, the version
column stores the resulting version uuid
when the versioning function is applied to dataset metadata.
What's the point in having both of these columns?
The uuid
column in dataset_versions
is the user facing versioning, while the version
column is internal and allows use to iterate on the versioning function without having user facing implications (like invalidating all dataset versions because the versioning logic has changed).
The
uuid
column indataset_versions
is the user facing versioning, while theversion
column is internal
If that was the original intention, it is definitely no longer the case. See the dataset/versions/
API. We call DatasetVersionService.findByWithRun
and pass in the version
parameter. That code calls findBy
, which, as I noted above, queries the version
column. Also, as pointed out, the RunDao
, constructs the DatasetVersionId
from the version
column and that Run
object is the return value of several APIs. From what I can see, the uuid
is never returned to the caller.
The uuid
is returned to the caller, but under the version
(or currentVersion
) field name in the response payload when creating or retrieving a dataset. We didn't want to return any uuids
to the caller indicating a primary key (in the early stages of Marquez, we were also thinking that maybe a graph database would be used in the future). This is similar to the runID where the column is uuid
, but returned as id
in the response payload.
We do return the uuid
to the caller when creating a dataset by setting the currentVersion
of a dataset to the uuid
of newly inserted version in dataset_versions
, see https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L106. We also do the same for job versions https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/JobVersionDao.java#L358
So, I do agree the names are confusing as the users sees version
which maps to the uuid
column in the dataset_versions
, but the application code sees version
column in dataset_versions
as the internal uuid
that may trigger a new row to be inserted.
It really comes down to naming here. Maybe we should have gone with checksum
or something similar to avoid the confusing. But, we also assumed we'd be exposing the version
column in dataset_versions
to users, but that was before the introduction of OpenLineage. I'm open to renaming the column in both job_versions
and dataset_versions
to something that's less confusing.
Definitely not just a naming issue - look at the code I linked to. The RunDao
returns the version
column from the database. I also confirmed the dataset/versions
API looks up datasets based on the version
column, not the uuid
column. E.g., with a freshly seeded instance:
$ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories"
{"id":{"namespace":"food_delivery","name":"public.categories"},"type":"DB_TABLE","name":"public.categories","physicalName":"public.categories","createdAt":"2022-05-05T22:48:12.288321Z","updatedAt":"2022-05-05T22:55:46.524400Z","namespace":"food_delivery","sourceName":"analytics_db","fields":[{"name":"menu_id","type":"INTEGER","tags":[],"description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","tags":[],"description":"The description of the category."},{"name":"id","type":"INTEGER","tags":[],"description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","tags":[],"description":"The name of the category."}],"tags":[],"lastModifiedAt":"2022-05-05T22:52:01.524400Z","lastLifecycleState":null,"description":null,"currentVersion":"e247cf97-d567-4c51-81e4-7c3eded64dd6","facets":{"stats":{"size":4430400,"rowCount":1000},"schema":{"fields":[{"name":"id","type":"INTEGER","description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","description":"The name of the category."},{"name":"menu_id","type":"INTEGER","description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","description":"The description of the category."}],"_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet"},"dataSource":{"uri":"jdbc:postgresql://localhost:3306/deliveries","name":"analytics_db","_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet"},"description":"A table for categories.","dataQuality":{"bytes":4430400,"rowCount":1000,"columnMetrics":{"id":{"nullCount":0,"distinctCount":1000},"name":{"nullCount":0,"distinctCount":700}}},"greatExpectations_assertions":{"assertions":[{"success":true,"expectationType":"expect_table_row_count_to_be_between"},{"success":true,"expectationType":"expect_column_to_exist"},{"column":"id","success":true,"expectationType":"expect_column_values_to_be_unique"},{"column":"id","success":true,"expectationType":"expect_column_values_to_not_be_null"},{"column":"name","success":true,"expectationType":"expect_column_values_to_not_match_regex"},{"column":"menu_id","success":true,"expectationType":"expect_column_values_to_not_be_null"}]}},"deleted":false}
Here, the currentVersion
field returns e247cf97-d567-4c51-81e4-7c3eded64dd6
. But if I try to get that version:
$ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories/versions/e247cf97-d567-4c51-81e4-7c3eded64dd6"
{"code":404,"message":"Dataset version 'e247cf97-d567-4c51-81e4-7c3eded64dd6' not found."}
And if I query all versions for that dataset
curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories/versions" | jq '.versions | map(.version)' | grep "e247cf97-d567-4c51-81e4-7c3eded64dd6"
the response is empty.
So, I was wrong that the uuid
column is never returned. But it's only returned sometimes and at other times, the version
column is returned.
So, I was wrong that the uuid column is never returned. But it's only returned sometimes and at other times, the version column is returned.
Right, I agree with you that there's an issue, and certain queries are using the wrong column, what I meant was that it's a side effect of the confusion around the column naming. Here's a related issue https://github.com/MarquezProject/marquez/issues/1883
Ultimately I agree with @wslulciuc. The queries DatasetVersionDao.findBy
, DatasetVersionDao.findByUuid
, and DatasetVersionDao.findAll
use the column version
, which, as I understand, is intended to be the internal/private DatasetVersion identifier. These queries (and the supporting data models/mappers), can be updated to use the DatasetVersion UUID (afaiu the public facing DatasetVersion identifier).
I agree that this is pretty confusing, and version
, should probably be renamed, and the relationship between the UUID & the existing version
should be more clearly defined. I propose renaming version
to internalVersionUuid
, updating DatasetVersionDao.find*
to use the DatasetVersion UUID, and adding a javadoc document comment to DatasetVersionDao
to highlight the difference between UUID
and the to-be-named internalVersionUuid
.
This update would also pave the way to resolve #1883
Thoughts?
My concern with using the version
column as our internal id is that it prohibits us from supporting versions from dataset sources, like Iceberg or Delta. I think it would be a better experience for users to be able to accept the original version in our URLs and to return the original version (e.g., a run outputs Dataset@versionXYZ and a user can simply query Iceberg Dataset@versionXYZ). This means we can't use the version as the primary key, as there's no guarantee that two data systems won't use the same version id.
In general, I think it's better to hide the primary key from users when there's another user-facing unique key that can be used instead (e.g., we use job names and dataset names, not primary keys, which aren't even part of the user-facing service model).
I might be misunderstanding, but version
in this context is a UUID, so I don't think a collision with the version id of another system is a concern.
My concern with using the version column as our internal id is that it prohibits us from supporting versions from dataset sources, like Iceberg or Delta
@collado-mike: I view this concern very similar to runIDs. Given that a runID must be a valid UUID
(based on the OpenLineage spec), any external runID (ex: Airflow that uses a date) must be translated to a UUID
before stored. This why we added the external_id
column to the runs
table (see V19__alter_run_to_add_external_id
)
The Marquez API doesn't provide the option to query a run based on an external runID, but that can be something we can eventually support and even display in the UI (which is probably a good idea anyways). Similarly for versioning, we can support an external version that is provided by the user as an OpenLineage facet as part of the event (that can possibly be queried). As @RNHTTR, the version must be a valid UUID
either way, and, like the runs
table, we can add an external_version
column to the datasets_versions
and job_versions
tables.
Now, it really comes down to naming here. I think the version
column is poorly named, and I think internalVersionUuid
would still cause some confusion. The version
column is used only within the versions tables for datasets and jobs to predictably generate datasets and job versions form metadata (see alsonewDatasetVersionFor()
and newJobVersionFor
).
How about we just use checksum
instead? The run_args
table has a checksum
column as well for similar purposes.
TBH, I don't see a lot of value coming from an internal version
column for datasets that is distinct from its UUID. The job version generation makes sense, as a heuristic approach to detecting job changes - from one run to the next, we can't know if the job code or configuration has changed. We make our best guess based on the inputs. With datasets, we know that every run that writes to, or otherwise alters, a dataset will change its version. That being the case, we could simply use run_uuid
as the column value and it would provide exactly the same value. In fact, it would be simpler and more deterministic (it wouldn't change from one version of Marquez to the next). But if we're given a version? Why use heuristics at all?
I agree there are similarities to the job run_uuid
- in fact, I opened the original issue in OpenLineage proposing that we make run_id
a UUID - https://github.com/OpenLineage/OpenLineage/issues/29 . The reasoning there, however, was for the sake of tracing run hierarchies -
Given the implementation already uses UUIDs for job runs, I think the requirement to pass the additional job information is a bit burdensome and redundant. It also leaves room for confusion in implementation projects, which would be nice to eliminate.
Because run_id
s weren't unique, parents had to provide a lot more context to children about the identifier of the job run. A UUID meant that a run could be identified entirely from that one parameter without having to pass other context. Datasets don't have that problem. We don't need a globally (or universally) unique id for a given dataset version. A monotonically increasing integer would work just fine.
From that perspective, I don't agree that the version must be a valid UUID. Especially, if the original intention was to keep the version
column hidden, the fact that we do use a UUID is entirely incidental.
So I think that if we can eliminate artificial constraints on what the version column contains, we can start to consider alternatives that provide more value for the user. I think everyone agrees that random UUIDs are not human-friendly or intuitive URL values. For that reason, we use job names and namespaces in our URLs rather than their primary keys. That URL design means it is possible to one day support cross-linking between different tools - e.g., imagine a link in the Airflow UI that says "see lineage for this task in Marquez" and the Airflow code immediately knows how to construct the link without knowing any of the details about how Marquez constructs its primary keys.
We can offer that same experience for Dataset versions - given a dataset version in any tool, we can easily construct a URL to the lineage of that dataset without knowing anything about the internal details of how we Marquez constructs a checksum or primary key. A company that had multiple Marquez deployments could rely on deterministic Dataset URLs between deployments in cases where lineage crossed installations (Willy, I'm thinking of a specific case you'll be familiar with).
Primary keys are great for creating foreign keys and indexes (it's easier for another table to simply reference a record's UUID rather than using composite keys), but I don't think they're great user-facing values. I think we already have better alternatives.
In the
OpenLineageService
, we construct aDatasetVersionId
using theuuid
property of the record - that is, the primary key. However, in theRunDao
, when we construct theDatasetVersionId
of the inputs and outputs of aRun
, we use theversion
property. This leads to confusion, as code that depends on the return value of theOpenLineageService
will expect aDatasetVersionId
that can be matched to the values returned by theRunDao
and it turns out they never match.OpenLineageService code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/OpenLineageService.java#L208
RunDao code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/RunDao.java#L85-L87