Investigate Data Lineage/Snapshotting Options

ca-scribner commented 4 years ago

Epic #57

Look into tools/workflows that allow for traceable data lineage. Some examples:

Pachyderm (investigated previously but we have some license challenges)
DVC
Kubeflow pipelines metadata/artifact built-in features
?

ca-scribner commented 4 years ago

After doing some research I feel a little uncertain about the user stories we're trying to address with these tools. @brendangadd @blairdrummond @chritter what are your thoughts?

Some possible examples:

For a given output of a pipeline (transformed data, trained model, etc) we should easily be able to see the inputs that went into that (input data, code) and optionally reproduce the outputs again (if we rerun it we get the same output)
Versioned model exploration should be low effort. We should provide a pattern for users to experiment (where they change any of: model parameters, model code, or input data) that lets them review results (can see which combo was better), pull the models they want to keep, etc

What else?

ca-scribner commented 4 years ago

For version control of the code behind the artifacts, there's some good CI version control examples here. They don't feel lightweight (takes some setup from the user, and would be a thing users need to learn how to configure), but they do feel comprehensive (all containers/yamls/everything else are well controlled by CI). Could also adapt this to run by script rather than CI commit if that's preferable.

Downside is these patterns rely on custom docker images and access to a cloud container builder. Could be a problem for us? Probably could use the same style approach with lightweight containers though. Maybe locally annotate the pipeline.py with github SHA on launch, or even just accept github SHA as a variable (and then put SHA on the pipeline version info shown in kfp UI).

sylus commented 4 years ago

I think we can make this simpler using Kaniko and letting users develop their own internally (know will need some guardrails).

Me and @zachomedia are doing a proof of concept example along with data lineage and compare runs, and sending the model to mlflow.

Basing our research against:

https://github.com/kaizentm/kubemlops https://github.com/kaizentm/kubemlops/blob/master/docs/mlops-github.md

Kaniko example:

https://github.com/kerstinpittl/kubeflow-workshop/blob/master/notebooks/02_Fairing/02_04_fairing_kaniko_cloud_builder.ipynb

DVC

I also like the new CML they launched so will try to get a PoC going for this.

Making progress should have this done by next sprint retrospective.

ca-scribner commented 4 years ago

Kaniko is for simplifying the container build process right? But then we've still got other parts of the system keeping track of artifact/data lineage, remembering what containers were used in a pipeline, etc?

sylus commented 4 years ago

Yup but that is what the data lineage tracks if you set it up with the KFP pipeline or have a custom watcher looking for certain annotations. Going to try to show this as a working example and then will make more sense.

ca-scribner commented 4 years ago

Yeah that makes sense. Interested to see what you put together. Then without too much effort you can automatically spot upstream artifact locations (watcher can see pipeline inputs) and component/code versions.

Re DVC: I’m going to play with that more next. They used to have their data lineage hard connected to a git repo for your code which made tracking deployment kinda tricky, but their new release in the last month is supposed to have made it more portable so I’ll play around

On Tue, Jul 14, 2020 at 15:23 William H notifications@github.com wrote:

Yup but that is what the data lineage tracks if you set it up with the KFP pipeline or have a custom watcher looking for certain annotations. Going to try to show this as a working example and then will make more sense.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/StatCan/daaas/issues/148#issuecomment-658366890, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALPFPI5A4CM4AM6GHQRSU7LR3SWDJANCNFSM4OYTVEVQ .

sylus commented 4 years ago

Think this is done as a PoC in https://github.com/statcan/kubeflow-mlops

StatCan / aaw

Investigate Data Lineage/Snapshotting Options #148