Kubeflow pipelines seems like a viable option for fairly complex workflows, but has a steep learning curve and can be a challenge for some workflows (such as map/reduce). Successfully getting regular users to build with it will take significant guidance from the dev team (best practices manuals, lots of examples, and providing reusable components for simple tasks)
Pros
Well integrated with kubernetes. Lots of flexibility for what you can do if you know what you're doing.
Handles fairly complex workflows well (easy to make any acyclic DAG you can think of, including conditionals)
Supports reusing components (eg: writing a get_data yaml component that can be used in N different pipelines rather than writing a get_data() code block at the top of each pipeline), although this is non-obvious to a new user. We'd need to encourage the pattern.
Easy to schedule pipelines. Can also trigger pipelines from command line
Cons
Steep learning curve (can mitigate with good examples, but you can't drop a beginner programmer into it and expect success).
SDK feels like a project that is under heavy development and has changed courses a few times. Would require lots of user support, especially for early community
Data exchange between pipeline steps feels harder than it needs to be. It is possible, but we'd need clear guidelines for best practices. Most users would feel overwhelmed without strong guidance
Support for things like arbitrary parallelism is weak (but a known weakness with some discussion on the boards). Other complex workflows might be similar. For example:
Easy: ingesting a file of arbitrary length, splitting it into 5 equal chunks, and processing them in parallel on different pods
Hard: ingesting a file of arbitrary length, splitting it into N chunks of 1000 lines each, and processing them in parallel on different pods
Questions / Todo
Are there triggers? Eg: run pipeline X with args A,B,C when event Z happens. Could maybe emulate this with the kfp cli tools
What is the model serving support like? It exists, but not sure whether it is useful
Need to investigate artifact browsing/lineage tracking in kubeflow. This training pdf (search "artifact tracking" and "lineage tracking" - sorry, no slide numbers!) shows some interesting data lineage features. Maybe useful to us?
Epic: #134
Summary
Kubeflow pipelines seems like a viable option for fairly complex workflows, but has a steep learning curve and can be a challenge for some workflows (such as map/reduce). Successfully getting regular users to build with it will take significant guidance from the dev team (best practices manuals, lots of examples, and providing reusable components for simple tasks)
Pros
get_data
yaml component that can be used in N different pipelines rather than writing aget_data()
code block at the top of each pipeline), although this is non-obvious to a new user. We'd need to encourage the pattern.Cons
SDK feels like a project that is under heavy development and has changed courses a few times. Would require lots of user support, especially for early community
Questions / Todo