Open dorx opened 3 years ago
Some notes from a brief discussion with @saulshanabrook
Would be nice to have a couple user stories as a good forcing function for us to evaluate the impact of the feature and think through the design more concretely.
Something like, a data scientist loaded in some data from S3 (or presto, or deltalake etc.), then they cleaned and modeled. A few days later, a data engineer productionizing it is no longer getting the same performance, so the data engineer ... (something about making use of the versions to achieve some goals).
These concrete workflows should also help the "further design considerations" which are needed to be clear before development.
@lionsardesai
After @dorx's in-person explanation yesterday, I see how this could work for us. Especially starting with option 1, would be easiest.
Currently, we have a major roadblock that multiple artifacts with the same name are not allowed. This is preventing us from using the same DB to re-run a notebook.
This proposal entails a number of UX changes, around having users understand versions, also around re-execution, which we currently don't expose to users (although we test internally).
What about adding a simple fix to support re-execution, like by overwriting an artifact, or always choosing the latest when doing a get for now? That way, we can wait till we have further nailed down the user stories before adding this versioning semantics.
I worry that if we add it now, without exposing it to users, that our UX might change later, and we will would have to take it out. Also it adds some conceptual complexity. now users need to understand versions as well as re-execution, which could be worth it, but isn't free by itself.
How about we start a conversation with our alpha users on the semantics of versions?
IMO, always overwriting doesn't seem like the most user-friendly option. It should be something that the user can explicitly set. So always create a new version by default, but have a bool arg in .save
for manually setting overwrite.
As for .get
, checking out the latest by default makes a ton of sense.
We should add an API somewhere, perhaps in the Artifact
object, that allows users to explore the version history for an artifact and switch to a specific version if needed. Or we have a top level API that allows for queries about the version history of a specific artifact. Having both probably would be the best.
Would love to hear from our @LineaLabs/alpha-users on how they'd like to interact with versions.
IMO, always overwriting doesn't seem like the most user-friendly option. It should be something that the user can explicitly set. So always create a new version by default, but have a bool arg in .save for manually setting overwrite.
Ah yeah I think I mean the same thing by "overwriting," in the sense that we would leave both artifacts in the DB, but when the user did a get, they would retrieve the latest.
Relevant reference on the topic: Airflow DAG versioning
Problem: As users develop data pipelines, they can introduce new operators, delete operators, and tweak operators within the pipeline. Usually, these modifications are considered different versions of the same pipeline, because they feed into the same application or updates the same data output location. Furthermore, even if the pipeline code doesn't change, the data could change between two executions of the pipeline, resulting in different outputs.
Current solution: Currently, versioning on the code is done either through GitHub on Python scripts or notebooks, or the user manually documents the version via tags/comments in workflow orchestration platforms. Data versioning is rarely done despite it being a top-of-mind concern for many ML engineers & data scientists. Only the latest version of the data is kept, so there is no way to retrieve old data for reproducibility.
Proposed Linea Solution: We propose two types of versions, major versions for code changes, and minor versions for data changes. Whenever the user publishes to the same artifact name (assuming artifact names are unique), it could potentially create a new version. TODO: API design decision: choose one of the following options for a major version bump:
.linea_publish
, a new major version is created, regardless of whether the pipeline changed. Cons: this could be very problematic if the user is doing restart & run-all constantly. Pros: it gives users the most agency..linea_publish
, only bump the version if the raw code string has changed. Pros: we would be able to capture changes such as documentation changes. Cons: we could be bumping versions for cosmetic changes that do not change the pipeline..linea_publish
, only bump the version if the DAG has been modified, which may or may not count variable name changes that do not alter dependencies. Pros: versions of the pipeline that generate identical results due to the same dependency structure would be considered equivalent. Cons: we wouldn't be able to capture changes in comments, etc, that do not alter dependencies but are still meaningful for the user.Minor version bump option (execution of pipelines without major version changes):
Further design considerations: