allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
229 stars 89 forks source link

Consider clarifying, is this an alternative for Kubeflow? #32

Open austinkeller opened 3 years ago

austinkeller commented 3 years ago

As a dummy who is evaluating different options for ML Ops, I don't have a full picture of how Kubeflow works. Does trains-agent integrate with Kubeflow? Or is it a more R&D-friendly replacement?

bmartinn commented 3 years ago

Hi @austinkeller

Or is it a more R&D-friendly replacement?

Kind of, but also integrates with Kubeflow :)

Specifically, Kubeflow assumes all steps are self contained containers, and that data can be volume mounted etc. In this aspect trains-agent solves the containerization problem and adds logging into the process.

To understand how trains work, usually the dev steps are:

  1. Write code on "local" machine. Using trains all the code/environment/arguments are logged (including a few other stuff, but less relevant to our case)
  2. Clone experiment in UI (or from code / automation)
  3. Put code into execution queue (the trains scheduler,it also includes priorities etc, with UI as part of the system UI, see trains-server)
  4. trains-agent running on remote machine in daemon setup, pulls the experiment from the execution queue, sets the environment accordingly and launch / monitor the process

Back to KubeFlow, since creating the experiment is done automatically (see step (1) trains records the environment and creates the experiment in runtime), trains-agent can build a docker container for the experiment to later be used by Kubeflow. This makes the packaging a lot easier (see trains-agent build --docker) . You can actually make it even lighter, and use trains-agent to setup and launch an experiment without packaging the experiment, but by using a base container and letting trains-agent setup everything inside the container (see trains-agent execute).

Does that remove a bit of the mystery ? What exactly is your use case ? (Is it more development oriented, or productization stage ?)