allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
242 stars 92 forks source link

RAM / CPU cores partitioning for multiple agents on the same machine #176

Open mkurczew opened 1 year ago

mkurczew commented 1 year ago

Hi, I have several workstations with many CPU cores and a lot of RAM. All agents run Ubuntu. I would like to run multiple ClearML agents (barebones, no k8s) on each of the workstations.

Can I somehow prevent one agent running a job from hoarding all of the resources (e.g. CPU cores) and guarantee each agent a minimum quota (prefably dynamic e.g. at least 8 cores and 1/3rd of RAM but more if avaialble)?

Or, alternatively, can I prevent the accidental scheduling of another job to the machine which is bogged down by other tasks?

Is there any way to achieve that with ClearML? Documentation suggests that the only resource agents "manage" is GPUs.

EDIT: Just started to wonder, should I file it here or under ClearML project?

ainoam commented 1 year ago

This is probably the right place @mkurczew :)

To police the resources available to an agent you can make use of extra_docker_arguments for example:

extra_docker_arguments=["--memory=1g", "--cpus=2"]

(see https://docs.docker.com/config/containers/resource_constraints/)