allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

How to combine ClearML with Kedro #716

Open Make42 opened 2 years ago

Make42 commented 2 years ago

We are currently product-hunting for our MLOps infrastructure and ClearML, Kedro, MLRun are on our short list. We are considering to combine ClearML with Kedro. They are similar in purpose but have different features if one looks at the details. E.g. Kedro has hooks for tasks that implement cross-cutting concerns. At reddit

In case we use both Kedro and ClearML, we'll have to figure out how to integrate its pipelines with ClearML tasks. But in the slack channel of ClearML there are other teams doing the same, so at least it's possible.

has been written. So, I would like to know how this would be done on both levels:

  1. Conceptually
  2. Pragmatically (meaning programming)
thepycoder commented 2 years ago

This heavily depends on what features from Kedro you want to use and which features from ClearML, but in general ClearML tools are pretty modular.

E.g. You could keep your feature store as a clearml-data versioned dataset and then use the clearml SDK to access it from within a Kedro node/pipeline.

or

You could track the version history of each Kedro node by simply tracking the node's code using the clearml experiment manager.

Heck, you could probably even run Kedro nodes as clearml tasks that are then remotely executed by clearml agents, simply by adding

from clearml import Task
task.init(project_name="my_project", task_name="my_task")
task.execute_remotely(queue="default")

to the python function that will be turned into a kedro node. When running the kedro pipeline, it will run the underlying python function, which in turn will register itself as a clearml task to be added to a clearml queue and executed by a clearml agent.

Point being, both tools are open source pip packages. Especially clearml does not force you to change any code, or structure your code in any particular way, so you should easily be able to add clearml feature where you want them, by just using the pip package.

That said, I have not tested any of these things! Simply based on my knowledge of clearml they should be possible, but I'm not very familiar with the inner workings of a Kedro pipeline, so take with a grain of salt :)

noklam commented 2 years ago

It depends on what kind of features you are looking into.

Remote execution of a kedro pipeline shouldn't be a problem. Experiment Tracking should also work pretty well, you can register these Task with before_pipeline_xxxx hooks, etc. Kedro is a CLI first library, which most users start with kedro run to run their pipeline, but you can easily run the pipeline equivalent with a Python API by creating a session. It is also well supported since this is the recommended way to run a pipeline in a notebook environment.

I would start with the CLI and use hooks for necessary clearml features that you needed and only use session unless it's necessary.

Data versioning may be the trickier part, I am not too familiar with how clearml is doing this, since kedro also comes with its own data versioning feature. It may make sense to not enable this in Kedro and simply delegate it to clearml.