LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
663 stars 58 forks source link

UX Design: Artifact Versioning #273

Open dorx opened 3 years ago

dorx commented 3 years ago

Relevant reference on the topic: Airflow DAG versioning

Problem: As users develop data pipelines, they can introduce new operators, delete operators, and tweak operators within the pipeline. Usually, these modifications are considered different versions of the same pipeline, because they feed into the same application or updates the same data output location. Furthermore, even if the pipeline code doesn't change, the data could change between two executions of the pipeline, resulting in different outputs.

Current solution: Currently, versioning on the code is done either through GitHub on Python scripts or notebooks, or the user manually documents the version via tags/comments in workflow orchestration platforms. Data versioning is rarely done despite it being a top-of-mind concern for many ML engineers & data scientists. Only the latest version of the data is kept, so there is no way to retrieve old data for reproducibility.

Proposed Linea Solution: We propose two types of versions, major versions for code changes, and minor versions for data changes. Whenever the user publishes to the same artifact name (assuming artifact names are unique), it could potentially create a new version. TODO: API design decision: choose one of the following options for a major version bump:

  1. every time the user calls .linea_publish, a new major version is created, regardless of whether the pipeline changed. Cons: this could be very problematic if the user is doing restart & run-all constantly. Pros: it gives users the most agency.
  2. every time the user calls .linea_publish, only bump the version if the raw code string has changed. Pros: we would be able to capture changes such as documentation changes. Cons: we could be bumping versions for cosmetic changes that do not change the pipeline.
  3. every time the user calls .linea_publish, only bump the version if the DAG has been modified, which may or may not count variable name changes that do not alter dependencies. Pros: versions of the pipeline that generate identical results due to the same dependency structure would be considered equivalent. Cons: we wouldn't be able to capture changes in comments, etc, that do not alter dependencies but are still meaningful for the user.

Minor version bump option (execution of pipelines without major version changes):

  1. Every time the pipeline is executed, a new minor version is created regardless of whether the source data has changed. Pros: no performance overhead of checking data; no false negatives on minor version equivalence. Cons: create versions that don't correspond to material changes in the results, which could potentially lead to redundant storage of output.
  2. Every time the pipeline is executed, only create a new minor version if the data source has changed (determined based on a combo of metadata and data content). Pros: potentially very little performance overhead if we're just checking metadata; less chance of storing redundant results. Cons: false positives on minor version equivalence when the source data has changed but the metadata didn't capture the change; performance overhead for different raw data value.
  3. Every time the pipeline is executed, only create a new minor version if the output has changed, based on a hash of the output. Pros: efficient storage; versions correspond well with actual data change. Cons: performance overhead of hashing/diffing data.

Further design considerations:

yifanwu commented 3 years ago

Some notes from a brief discussion with @saulshanabrook

Would be nice to have a couple user stories as a good forcing function for us to evaluate the impact of the feature and think through the design more concretely.

Something like, a data scientist loaded in some data from S3 (or presto, or deltalake etc.), then they cleaned and modeled. A few days later, a data engineer productionizing it is no longer getting the same performance, so the data engineer ... (something about making use of the versions to achieve some goals).

These concrete workflows should also help the "further design considerations" which are needed to be clear before development.

dorx commented 3 years ago

@lionsardesai

saulshanabrook commented 3 years ago

After @dorx's in-person explanation yesterday, I see how this could work for us. Especially starting with option 1, would be easiest.

Currently, we have a major roadblock that multiple artifacts with the same name are not allowed. This is preventing us from using the same DB to re-run a notebook.

This proposal entails a number of UX changes, around having users understand versions, also around re-execution, which we currently don't expose to users (although we test internally).

What about adding a simple fix to support re-execution, like by overwriting an artifact, or always choosing the latest when doing a get for now? That way, we can wait till we have further nailed down the user stories before adding this versioning semantics.

I worry that if we add it now, without exposing it to users, that our UX might change later, and we will would have to take it out. Also it adds some conceptual complexity. now users need to understand versions as well as re-execution, which could be worth it, but isn't free by itself.

dorx commented 3 years ago

How about we start a conversation with our alpha users on the semantics of versions?

IMO, always overwriting doesn't seem like the most user-friendly option. It should be something that the user can explicitly set. So always create a new version by default, but have a bool arg in .save for manually setting overwrite.

As for .get, checking out the latest by default makes a ton of sense.

We should add an API somewhere, perhaps in the Artifact object, that allows users to explore the version history for an artifact and switch to a specific version if needed. Or we have a top level API that allows for queries about the version history of a specific artifact. Having both probably would be the best.

Would love to hear from our @LineaLabs/alpha-users on how they'd like to interact with versions.

saulshanabrook commented 3 years ago

IMO, always overwriting doesn't seem like the most user-friendly option. It should be something that the user can explicitly set. So always create a new version by default, but have a bool arg in .save for manually setting overwrite.

Ah yeah I think I mean the same thing by "overwriting," in the sense that we would leave both artifacts in the DB, but when the user did a get, they would retrieve the latest.