Closed void-main closed 1 week ago
Searched the doc for a little bit, and looks like I should use the k8s glue.
The work flow should be like the following:
Could anyone please tell me if the understanding is correct? Thanks
@allegro-ai @bmartinn @jkhenning
If the above understanding is correct, may I ask how do clear.ml manager the k8s resource conflicts?
For example, what happens when I do the following operations:
Will clear.ml scheduler actually run 3 jobs for the non k8s glue task? Or will the clearml-agent sense the k8s glue code job, and only schedule a single node job?
Hi @void-main,
You plan seems correct to me. As for the conflict question, there is no conflict - the ClearML k8s glue agent does not take any node, it's simply running as a control-plane pod, and uses k8s to schedule a new pod for every task that it finds in the queue. It's up to k8s to provision the resources and start the task pod (according to the spec/template created by the glue agent)
Thanks for the explanation @jkhenning !
Proposal Summary
Please add support for Megatron-LM integration.
Motivation
We want to train LLM with Megatron-LM, normally we launch tasks by hand on our k8s cluster.
But we want many cool features from clearml, for example, pipelines.
So I wonder if it's possible to launch megatron training job from clearml? If so, are there any documentations on that?
Related Discussion
None