kedro-airflow support - Githubissues

mwiewior commented 4 years ago

Hi - is anyone working currently on integration with kedro-airflow (or pipeline scheduling in general)? I've got it working but the problem is that each task within a DAG is tracked under a separate run id which of course does not make much sense here. I'm thinking of adding a feature to track the whole pipeline under the same run id when scheduled with airflow. Any comments, hints how to approach that more than welcome!

Galileo-Galilei commented 4 years ago

Hello @mwiewior ,

glad to see that you are trying the plugin out. What you describe is a common problem on how mlflow and airflow interoperates, and unfortunately it is hardly related to the kedro-mlflow plugin. We may have some way to trick it out (for instance, create a child class that inherits from the KedroAirflowRunner), but it will likely not be straightforward and may have side effects that we should discuss in details before we decide to suport it. I will make a post with more detailed analysis by the end of the week, but can you share your workflow in a detailed way to see how you use kedro, kedro-mlflow, kedro-airflow together? It will help a lot to see if we can get a quick solution (even if it is not the most sustainable one on the long run) to your problem.

szczeles commented 3 years ago

@Galileo-Galilei We faced the same issue while developing kedro-kubeflow plugin (Kubeflow Pipelines is a scheduler, like Airflow, but it starts all the nodes as Kubernetes pods only).

The solution we have is quite simple - one, additional step that runs nothing else than mlflow.start_runand exposes the run id as MLFLOW_RUN_ID environment variable. Then - if the run id in the kedro-mlflow config is set to None, the value from env is used, causing all the metrics/parameters/artifacts logged within one run, as expected:

https://kedro-kubeflow.readthedocs.io/en/0.3.0/source/03_getting_started/03_mlflow.html

Do you think this method could work for kedro-airflow as well, or there can be some other issues?

Galileo-Galilei commented 3 years ago

Hi @szczeles, sorry for the late reply.

First of all, kudos to you guys for kedro-kubeflow! I have seen the development and I looked at how you handle mlflow configuration with this specific issue in mind ;)

Basically, you add a node which plays the role of the "before_pipeline_run" hook. I am not sure that it can work for Airflow, but keep in mind I have almost never use it so it's hard to be really assertive.

There are a few things to consider:

Given above comments, it seems (but I am not 100% sure) that the problem is that airflow starts a new run in each "task", which means that these tasks do not share a common environment. We probably need to trick it out by adding an extra xcom_pull(task_ids=previous_task) command to propagate the run id task after task, which is much more complex than just starting the run and letting mlflow manage it. We also need to have an another specific management for the datasets from the DataCatalog (either MlflowArtifactDataSet, MlflowModelLoggerDataSet, or MlflowMetricsDataSet). EDIT: I've seen here that you provide env var node by node: if the same mechanism exists for airflow, it will certainly works for it too!
This would not be able to handle the specific case of the PipelineML objects which are stored automatically as Mlflow Models at the end of pipeline execution for further serving, but we can state clearly that it is an advance use that is not supported when converting to airflow
From a user perspective, I don't know what is the most convenient way to add such a modification:
- add a cli command like you did
- create a MlflowAirflowRunner class which adds this extra step at running time?

I must confess that since I do not use airflow as a scheduler in my day to day life, this has not been of paramount importance to me. I plan to support airflow, but I have no timeline to provide right now. I'd say likely not before this summer. PR are welcome if you really need it soon!

P.S: This might be a bit off topic and is more general that this specific issue, but I have seen that almost all kedro plugins or tutorials for deployment to an orchestrator (airflow, kubeflow, prefect, argo...) tend to simply convert the kedro pipelines to the another pipeline of the target tool. I don't feel this is the right way to deploy ml applications in general, because kedro pipelines often contains a lot of nodes with very small operations where no data is persisted between nodes, especially for the ML pipelines. From an orchestration point of view, this is likely a single node that must be executed once, and eventually on a dedicated infra (GPU...), while other pipelines (for heavy feature extraction or engineering) might need a different infra / orchestration timeline. In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes.

limdauto commented 3 years ago

@Galileo-Galilei I completely agree with this assessment

In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes.

The guides are more of a tutorial than anything else to convert Kedro pipeline to target orchestration platform's primitives. The reason why we don't make all of them into plugins were precisely because of what you say here: how you slice your pipeline and map them to the orchestrator's nodes is up to you. A good pattern is using tag. You can map a set of nodes tagged with data_engineering and deploy it differently to the set of nodes tagged with data_science. Or you can map them based on namespace if you construct your pipeline out of modular pipelines.

Regarding Airflow & Mlflow, let me take a stab this weekend. We are in active discussion with the Airflow team actually. Would love to show case a Kedro x Mlflow on Airflow maybe through using your excellent plugin.

Galileo-Galilei commented 3 years ago

Thank you for your instructive comment (and nice feedback on the plugin!) on the point of view of kedro team on this. I had this intuition when I saw you did choose tutorials rather than plugins as you explain.

This is completely in line with how I envision ML deployment: you can read in the README of this example project how I suggest an organisation in 3 "apps" each containing several Kedro pipelines which are contructed by filtering out a bigger pipeline with tags (but the same apply with namespace). This is not explicit in the example, but in my mind the objects which are going to be turned into orchestrator nodes are these smaller pipelines.

If you want to read more on this, we have a documentation PR opened which describes how one can use kedro-mlflow as a mlops framework. This is quite theoretical (and maybe more suited to a blog post rather than techincal documentation) and focus on the fact that we need to synchronize training and inference development which is a big issue in ML (but not our point here), but the underlying proposed architecture is always described with a deployment to an orchestrator in mind. For you record, this is very close to how my team deploy its kedro projects in real world, at least for the underlying principles.

I have in mind to open an issue to Kedro's repo to give feedback on deployment strategies and suggest some doc design for deployment, but I need to think deeply and design it carefully, so it can take weeks (months?).

To be back to this topic, I'd be glad to come up to a solution for the interaction betwen the 3 tools since it seems to be a "hot topic" for some users. It's even better if this solution is supported by Kedro & Airflow's core teams! I'd love to see what you will come up with and I'll support it as much as I can, so I'll wait for your feedback.

Galileo-Galilei commented 2 years ago

Hi everyone,

given all discusssions above and after many thougts:

kedro-mlflow will not be modified specifically to integrate tracking functionnality in airflow. Given @szczeles comments (and fantastic work on kedro-kubeflow) and above discussion, it is clear that accessing mlflow inside airflow tasks is the role of the orchestrator itself, not a kedro plugin (airflow is not even aware that kedro and kedro-mlflow exist). While it is quite easy to tweak airflow to propagate artifically a mlflow run_id between tasks (but I don't think it is really what end users want: airflow is not for experimentation, so when you are orchestrating your pipelines it is very likely that the mlflow tracking functionnality is no longer needed), it is clear that is very hard to recreate all mlflow functionalities inside airflow (and once again, it is far from the functionalities of this plugin which focuses on kedro). Making this two mainstream package compatible is a mammoth task that I cannot afford by myself.
On the other hand, there are many questions about how to deploy kedro pipelines (and kedro-mlflow models) on airflow, and I will add a tutorial in the kedro-mlflow-tutorial repo (see link just above this comment) to explain how I envision deployment. It will be both opinionated and beginners-oriented, so it may be not what you expect. Unfortunately I can't offer more for now.

P.S: @limdauto if you came up with something you want to share after tour discussions with the airflow teams, feel free to reopen the issue

Galileo-Galilei / kedro-mlflow

kedro-airflow support #44