Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
199 stars 33 forks source link

kedro-airflow support #44

Closed mwiewior closed 2 years ago

mwiewior commented 4 years ago

Hi - is anyone working currently on integration with kedro-airflow (or pipeline scheduling in general)? I've got it working but the problem is that each task within a DAG is tracked under a separate run id which of course does not make much sense here. I'm thinking of adding a feature to track the whole pipeline under the same run id when scheduled with airflow. Any comments, hints how to approach that more than welcome!

Galileo-Galilei commented 4 years ago

Hello @mwiewior ,

glad to see that you are trying the plugin out. What you describe is a common problem on how mlflow and airflow interoperates, and unfortunately it is hardly related to the kedro-mlflow plugin. We may have some way to trick it out (for instance, create a child class that inherits from the KedroAirflowRunner), but it will likely not be straightforward and may have side effects that we should discuss in details before we decide to suport it. I will make a post with more detailed analysis by the end of the week, but can you share your workflow in a detailed way to see how you use kedro, kedro-mlflow, kedro-airflow together? It will help a lot to see if we can get a quick solution (even if it is not the most sustainable one on the long run) to your problem.

szczeles commented 3 years ago

@Galileo-Galilei We faced the same issue while developing kedro-kubeflow plugin (Kubeflow Pipelines is a scheduler, like Airflow, but it starts all the nodes as Kubernetes pods only).

The solution we have is quite simple - one, additional step that runs nothing else than mlflow.start_runand exposes the run id as MLFLOW_RUN_ID environment variable. Then - if the run id in the kedro-mlflow config is set to None, the value from env is used, causing all the metrics/parameters/artifacts logged within one run, as expected:

https://kedro-kubeflow.readthedocs.io/en/0.3.0/source/03_getting_started/03_mlflow.html

Do you think this method could work for kedro-airflow as well, or there can be some other issues?

Galileo-Galilei commented 3 years ago

Hi @szczeles, sorry for the late reply.

First of all, kudos to you guys for kedro-kubeflow! I have seen the development and I looked at how you handle mlflow configuration with this specific issue in mind ;)

Basically, you add a node which plays the role of the "before_pipeline_run" hook. I am not sure that it can work for Airflow, but keep in mind I have almost never use it so it's hard to be really assertive.

There are a few things to consider:

I must confess that since I do not use airflow as a scheduler in my day to day life, this has not been of paramount importance to me. I plan to support airflow, but I have no timeline to provide right now. I'd say likely not before this summer. PR are welcome if you really need it soon!

P.S: This might be a bit off topic and is more general that this specific issue, but I have seen that almost all kedro plugins or tutorials for deployment to an orchestrator (airflow, kubeflow, prefect, argo...) tend to simply convert the kedro pipelines to the another pipeline of the target tool. I don't feel this is the right way to deploy ml applications in general, because kedro pipelines often contains a lot of nodes with very small operations where no data is persisted between nodes, especially for the ML pipelines. From an orchestration point of view, this is likely a single node that must be executed once, and eventually on a dedicated infra (GPU...), while other pipelines (for heavy feature extraction or engineering) might need a different infra / orchestration timeline. In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes.

limdauto commented 3 years ago

@Galileo-Galilei I completely agree with this assessment

In a nutshell, I don't think there is an exact mapping between kedro nodes (designed by a data scientist for code readibility, easy debugging and partial execution...) and orchestrator nodes (designed for system robustness, ease of asynchronous execution, retry strategies, efficient compute...). kedro nodes are much more low-level in my opinion than orchestrator nodes.

The guides are more of a tutorial than anything else to convert Kedro pipeline to target orchestration platform's primitives. The reason why we don't make all of them into plugins were precisely because of what you say here: how you slice your pipeline and map them to the orchestrator's nodes is up to you. A good pattern is using tag. You can map a set of nodes tagged with data_engineering and deploy it differently to the set of nodes tagged with data_science. Or you can map them based on namespace if you construct your pipeline out of modular pipelines.

Regarding Airflow & Mlflow, let me take a stab this weekend. We are in active discussion with the Airflow team actually. Would love to show case a Kedro x Mlflow on Airflow maybe through using your excellent plugin.

Galileo-Galilei commented 3 years ago

Thank you for your instructive comment (and nice feedback on the plugin!) on the point of view of kedro team on this. I had this intuition when I saw you did choose tutorials rather than plugins as you explain.

This is completely in line with how I envision ML deployment: you can read in the README of this example project how I suggest an organisation in 3 "apps" each containing several Kedro pipelines which are contructed by filtering out a bigger pipeline with tags (but the same apply with namespace). This is not explicit in the example, but in my mind the objects which are going to be turned into orchestrator nodes are these smaller pipelines.

If you want to read more on this, we have a documentation PR opened which describes how one can use kedro-mlflow as a mlops framework. This is quite theoretical (and maybe more suited to a blog post rather than techincal documentation) and focus on the fact that we need to synchronize training and inference development which is a big issue in ML (but not our point here), but the underlying proposed architecture is always described with a deployment to an orchestrator in mind. For you record, this is very close to how my team deploy its kedro projects in real world, at least for the underlying principles.

I have in mind to open an issue to Kedro's repo to give feedback on deployment strategies and suggest some doc design for deployment, but I need to think deeply and design it carefully, so it can take weeks (months?).

To be back to this topic, I'd be glad to come up to a solution for the interaction betwen the 3 tools since it seems to be a "hot topic" for some users. It's even better if this solution is supported by Kedro & Airflow's core teams! I'd love to see what you will come up with and I'll support it as much as I can, so I'll wait for your feedback.

Galileo-Galilei commented 2 years ago

Hi everyone,

given all discusssions above and after many thougts:

P.S: @limdauto if you came up with something you want to share after tour discussions with the airflow teams, feel free to reopen the issue