MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.76k stars 314 forks source link

Discrepancy in building DatasetVersionIds and which `version` is used #1977

Open collado-mike opened 2 years ago

collado-mike commented 2 years ago

In the OpenLineageService, we construct a DatasetVersionId using the uuid property of the record - that is, the primary key. However, in the RunDao, when we construct the DatasetVersionId of the inputs and outputs of a Run, we use the version property. This leads to confusion, as code that depends on the return value of the OpenLineageService will expect a DatasetVersionId that can be matched to the values returned by the RunDao and it turns out they never match.

OpenLineageService code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/service/OpenLineageService.java#L208

RunDao code: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/RunDao.java#L85-L87

collado-mike commented 2 years ago

Of note - it is expected that both the version column and the uuid column are unique in the dataset_versions table (see https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L156-L158 and https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L198-L200 - both queries are expected to return a single result). What's the point in having both of these columns? I imagine the expectation is that the version column could be used to track external versions (e.g., delta or iceberg dataset version ids), but in that case, we can't expect the version to be unique. And in that case, shouldn't the DatsetVersionId always be the fully qualified external dataset id (that is the external version)?

wslulciuc commented 2 years ago

However, in the RunDao, when we construct the DatasetVersionId of the inputs and outputs of a Run, we use the version property

The DatasetVersionId should always be using the uuid column in dataset_versions, the version column stores the resulting version uuid when the versioning function is applied to dataset metadata.

What's the point in having both of these columns?

The uuid column in dataset_versions is the user facing versioning, while the version column is internal and allows use to iterate on the versioning function without having user facing implications (like invalidating all dataset versions because the versioning logic has changed).

collado-mike commented 2 years ago

The uuid column in dataset_versions is the user facing versioning, while the version column is internal

If that was the original intention, it is definitely no longer the case. See the dataset/versions/ API. We call DatasetVersionService.findByWithRun and pass in the version parameter. That code calls findBy, which, as I noted above, queries the version column. Also, as pointed out, the RunDao, constructs the DatasetVersionId from the version column and that Run object is the return value of several APIs. From what I can see, the uuid is never returned to the caller.

wslulciuc commented 2 years ago

The uuid is returned to the caller, but under the version (or currentVersion) field name in the response payload when creating or retrieving a dataset. We didn't want to return any uuids to the caller indicating a primary key (in the early stages of Marquez, we were also thinking that maybe a graph database would be used in the future). This is similar to the runID where the column is uuid, but returned as id in the response payload.

We do return the uuid to the caller when creating a dataset by setting the currentVersion of a dataset to the uuid of newly inserted version in dataset_versions, see https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/DatasetVersionDao.java#L106. We also do the same for job versions https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/JobVersionDao.java#L358

So, I do agree the names are confusing as the users sees version which maps to the uuid column in the dataset_versions, but the application code sees version column in dataset_versions as the internal uuid that may trigger a new row to be inserted.

It really comes down to naming here. Maybe we should have gone with checksum or something similar to avoid the confusing. But, we also assumed we'd be exposing the version column in dataset_versions to users, but that was before the introduction of OpenLineage. I'm open to renaming the column in both job_versions and dataset_versions to something that's less confusing.

collado-mike commented 2 years ago

Definitely not just a naming issue - look at the code I linked to. The RunDao returns the version column from the database. I also confirmed the dataset/versions API looks up datasets based on the version column, not the uuid column. E.g., with a freshly seeded instance:

$ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories"
{"id":{"namespace":"food_delivery","name":"public.categories"},"type":"DB_TABLE","name":"public.categories","physicalName":"public.categories","createdAt":"2022-05-05T22:48:12.288321Z","updatedAt":"2022-05-05T22:55:46.524400Z","namespace":"food_delivery","sourceName":"analytics_db","fields":[{"name":"menu_id","type":"INTEGER","tags":[],"description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","tags":[],"description":"The description of the category."},{"name":"id","type":"INTEGER","tags":[],"description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","tags":[],"description":"The name of the category."}],"tags":[],"lastModifiedAt":"2022-05-05T22:52:01.524400Z","lastLifecycleState":null,"description":null,"currentVersion":"e247cf97-d567-4c51-81e4-7c3eded64dd6","facets":{"stats":{"size":4430400,"rowCount":1000},"schema":{"fields":[{"name":"id","type":"INTEGER","description":"The unique ID of the category."},{"name":"name","type":"VARCHAR","description":"The name of the category."},{"name":"menu_id","type":"INTEGER","description":"The ID of the menu related to the category."},{"name":"description","type":"TEXT","description":"The description of the category."}],"_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet"},"dataSource":{"uri":"jdbc:postgresql://localhost:3306/deliveries","name":"analytics_db","_producer":"com.datakin.cli.SeedCommand","_schemaURL":"https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet"},"description":"A table for categories.","dataQuality":{"bytes":4430400,"rowCount":1000,"columnMetrics":{"id":{"nullCount":0,"distinctCount":1000},"name":{"nullCount":0,"distinctCount":700}}},"greatExpectations_assertions":{"assertions":[{"success":true,"expectationType":"expect_table_row_count_to_be_between"},{"success":true,"expectationType":"expect_column_to_exist"},{"column":"id","success":true,"expectationType":"expect_column_values_to_be_unique"},{"column":"id","success":true,"expectationType":"expect_column_values_to_not_be_null"},{"column":"name","success":true,"expectationType":"expect_column_values_to_not_match_regex"},{"column":"menu_id","success":true,"expectationType":"expect_column_values_to_not_be_null"}]}},"deleted":false}

Here, the currentVersion field returns e247cf97-d567-4c51-81e4-7c3eded64dd6. But if I try to get that version:

$ curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories/versions/e247cf97-d567-4c51-81e4-7c3eded64dd6"
{"code":404,"message":"Dataset version 'e247cf97-d567-4c51-81e4-7c3eded64dd6' not found."}

And if I query all versions for that dataset

curl "http://localhost:5020/api/v1/namespaces/food_delivery/datasets/public.categories/versions" | jq '.versions | map(.version)' | grep "e247cf97-d567-4c51-81e4-7c3eded64dd6"

the response is empty.

So, I was wrong that the uuid column is never returned. But it's only returned sometimes and at other times, the version column is returned.

wslulciuc commented 2 years ago

So, I was wrong that the uuid column is never returned. But it's only returned sometimes and at other times, the version column is returned.

Right, I agree with you that there's an issue, and certain queries are using the wrong column, what I meant was that it's a side effect of the confusion around the column naming. Here's a related issue https://github.com/MarquezProject/marquez/issues/1883

RNHTTR commented 2 years ago

Ultimately I agree with @wslulciuc. The queries DatasetVersionDao.findBy, DatasetVersionDao.findByUuid, and DatasetVersionDao.findAll use the column version, which, as I understand, is intended to be the internal/private DatasetVersion identifier. These queries (and the supporting data models/mappers), can be updated to use the DatasetVersion UUID (afaiu the public facing DatasetVersion identifier).

I agree that this is pretty confusing, and version, should probably be renamed, and the relationship between the UUID & the existing version should be more clearly defined. I propose renaming version to internalVersionUuid, updating DatasetVersionDao.find* to use the DatasetVersion UUID, and adding a javadoc document comment to DatasetVersionDao to highlight the difference between UUID and the to-be-named internalVersionUuid.

This update would also pave the way to resolve #1883

Thoughts?

collado-mike commented 2 years ago

My concern with using the version column as our internal id is that it prohibits us from supporting versions from dataset sources, like Iceberg or Delta. I think it would be a better experience for users to be able to accept the original version in our URLs and to return the original version (e.g., a run outputs Dataset@versionXYZ and a user can simply query Iceberg Dataset@versionXYZ). This means we can't use the version as the primary key, as there's no guarantee that two data systems won't use the same version id.

In general, I think it's better to hide the primary key from users when there's another user-facing unique key that can be used instead (e.g., we use job names and dataset names, not primary keys, which aren't even part of the user-facing service model).

RNHTTR commented 2 years ago

I might be misunderstanding, but version in this context is a UUID, so I don't think a collision with the version id of another system is a concern.

wslulciuc commented 2 years ago

My concern with using the version column as our internal id is that it prohibits us from supporting versions from dataset sources, like Iceberg or Delta

@collado-mike: I view this concern very similar to runIDs. Given that a runID must be a valid UUID (based on the OpenLineage spec), any external runID (ex: Airflow that uses a date) must be translated to a UUID before stored. This why we added the external_id column to the runs table (see V19__alter_run_to_add_external_id)

The Marquez API doesn't provide the option to query a run based on an external runID, but that can be something we can eventually support and even display in the UI (which is probably a good idea anyways). Similarly for versioning, we can support an external version that is provided by the user as an OpenLineage facet as part of the event (that can possibly be queried). As @RNHTTR, the version must be a valid UUID either way, and, like the runs table, we can add an external_version column to the datasets_versions and job_versions tables.

Now, it really comes down to naming here. I think the version column is poorly named, and I think internalVersionUuid would still cause some confusion. The version column is used only within the versions tables for datasets and jobs to predictably generate datasets and job versions form metadata (see alsonewDatasetVersionFor() and newJobVersionFor).

How about we just use checksum instead? The run_args table has a checksum column as well for similar purposes.

collado-mike commented 2 years ago

TBH, I don't see a lot of value coming from an internal version column for datasets that is distinct from its UUID. The job version generation makes sense, as a heuristic approach to detecting job changes - from one run to the next, we can't know if the job code or configuration has changed. We make our best guess based on the inputs. With datasets, we know that every run that writes to, or otherwise alters, a dataset will change its version. That being the case, we could simply use run_uuid as the column value and it would provide exactly the same value. In fact, it would be simpler and more deterministic (it wouldn't change from one version of Marquez to the next). But if we're given a version? Why use heuristics at all?

I agree there are similarities to the job run_uuid - in fact, I opened the original issue in OpenLineage proposing that we make run_id a UUID - https://github.com/OpenLineage/OpenLineage/issues/29 . The reasoning there, however, was for the sake of tracing run hierarchies -

Given the implementation already uses UUIDs for job runs, I think the requirement to pass the additional job information is a bit burdensome and redundant. It also leaves room for confusion in implementation projects, which would be nice to eliminate.

Because run_ids weren't unique, parents had to provide a lot more context to children about the identifier of the job run. A UUID meant that a run could be identified entirely from that one parameter without having to pass other context. Datasets don't have that problem. We don't need a globally (or universally) unique id for a given dataset version. A monotonically increasing integer would work just fine.

From that perspective, I don't agree that the version must be a valid UUID. Especially, if the original intention was to keep the version column hidden, the fact that we do use a UUID is entirely incidental.

So I think that if we can eliminate artificial constraints on what the version column contains, we can start to consider alternatives that provide more value for the user. I think everyone agrees that random UUIDs are not human-friendly or intuitive URL values. For that reason, we use job names and namespaces in our URLs rather than their primary keys. That URL design means it is possible to one day support cross-linking between different tools - e.g., imagine a link in the Airflow UI that says "see lineage for this task in Marquez" and the Airflow code immediately knows how to construct the link without knowing any of the details about how Marquez constructs its primary keys.

We can offer that same experience for Dataset versions - given a dataset version in any tool, we can easily construct a URL to the lineage of that dataset without knowing anything about the internal details of how we Marquez constructs a checksum or primary key. A company that had multiple Marquez deployments could rely on deterministic Dataset URLs between deployments in cases where lineage crossed installations (Willy, I'm thinking of a specific case you'll be familiar with).

Primary keys are great for creating foreign keys and indexes (it's easier for another table to simply reference a record's UUID rather than using composite keys), but I don't think they're great user-facing values. I think we already have better alternatives.