allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

Support Megatron-LM training job on k8s cluster #1287

Closed void-main closed 1 week ago

void-main commented 1 week ago

Proposal Summary

Please add support for Megatron-LM integration.

Motivation

We want to train LLM with Megatron-LM, normally we launch tasks by hand on our k8s cluster.

But we want many cool features from clearml, for example, pipelines.

So I wonder if it's possible to launch megatron training job from clearml? If so, are there any documentations on that?

Related Discussion

None

void-main commented 1 week ago

Searched the doc for a little bit, and looks like I should use the k8s glue.

The work flow should be like the following:

Could anyone please tell me if the understanding is correct? Thanks

@allegro-ai @bmartinn @jkhenning

void-main commented 1 week ago

If the above understanding is correct, may I ask how do clear.ml manager the k8s resource conflicts?

For example, what happens when I do the following operations:

  1. let's say we have a cluster of 3 nodes;
  2. start a task with custom k8s glue code, and it takes up 2 nodes;
  3. start a non k8s glue task, and configure the autoscaler for k8s cluster, and the job could go with up to 3 nodes;

Will clear.ml scheduler actually run 3 jobs for the non k8s glue task? Or will the clearml-agent sense the k8s glue code job, and only schedule a single node job?

jkhenning commented 1 week ago

Hi @void-main,

You plan seems correct to me. As for the conflict question, there is no conflict - the ClearML k8s glue agent does not take any node, it's simply running as a control-plane pod, and uses k8s to schedule a new pod for every task that it finds in the queue. It's up to k8s to provision the resources and start the task pod (according to the spec/template created by the glue agent)

void-main commented 1 week ago

Thanks for the explanation @jkhenning !