StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
67 stars 12 forks source link

Investigate Data Lineage/Snapshotting Options #148

Closed ca-scribner closed 2 years ago

ca-scribner commented 4 years ago

Epic #57

Look into tools/workflows that allow for traceable data lineage. Some examples:

ca-scribner commented 4 years ago

After doing some research I feel a little uncertain about the user stories we're trying to address with these tools. @brendangadd @blairdrummond @chritter what are your thoughts?

Some possible examples:

What else?

ca-scribner commented 4 years ago

For version control of the code behind the artifacts, there's some good CI version control examples here. They don't feel lightweight (takes some setup from the user, and would be a thing users need to learn how to configure), but they do feel comprehensive (all containers/yamls/everything else are well controlled by CI). Could also adapt this to run by script rather than CI commit if that's preferable.

Downside is these patterns rely on custom docker images and access to a cloud container builder. Could be a problem for us? Probably could use the same style approach with lightweight containers though. Maybe locally annotate the pipeline.py with github SHA on launch, or even just accept github SHA as a variable (and then put SHA on the pipeline version info shown in kfp UI).

sylus commented 4 years ago

I think we can make this simpler using Kaniko and letting users develop their own internally (know will need some guardrails).

Me and @zachomedia are doing a proof of concept example along with data lineage and compare runs, and sending the model to mlflow.

Basing our research against:

https://github.com/kaizentm/kubemlops https://github.com/kaizentm/kubemlops/blob/master/docs/mlops-github.md

Kaniko example:

https://github.com/kerstinpittl/kubeflow-workshop/blob/master/notebooks/02_Fairing/02_04_fairing_kaniko_cloud_builder.ipynb

DVC

I also like the new CML they launched so will try to get a PoC going for this.

Making progress should have this done by next sprint retrospective.

ca-scribner commented 4 years ago

Kaniko is for simplifying the container build process right? But then we've still got other parts of the system keeping track of artifact/data lineage, remembering what containers were used in a pipeline, etc?

sylus commented 4 years ago

Yup but that is what the data lineage tracks if you set it up with the KFP pipeline or have a custom watcher looking for certain annotations. Going to try to show this as a working example and then will make more sense.

ca-scribner commented 4 years ago

Yeah that makes sense. Interested to see what you put together. Then without too much effort you can automatically spot upstream artifact locations (watcher can see pipeline inputs) and component/code versions.

Re DVC: I’m going to play with that more next. They used to have their data lineage hard connected to a git repo for your code which made tracking deployment kinda tricky, but their new release in the last month is supposed to have made it more portable so I’ll play around

On Tue, Jul 14, 2020 at 15:23 William H notifications@github.com wrote:

Yup but that is what the data lineage tracks if you set it up with the KFP pipeline or have a custom watcher looking for certain annotations. Going to try to show this as a working example and then will make more sense.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/StatCan/daaas/issues/148#issuecomment-658366890, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALPFPI5A4CM4AM6GHQRSU7LR3SWDJANCNFSM4OYTVEVQ .

sylus commented 4 years ago

Think this is done as a PoC in https://github.com/statcan/kubeflow-mlops