Simon-Harris-IBM / ObjectNetChallenge-Workflows

Workflows for ObjectNet Challenge
0 stars 1 forks source link

Tune memory/cpu/disk allocated to docker #4

Closed Simon-Harris-IBM closed 4 years ago

Simon-Harris-IBM commented 4 years ago

Need to tune cpu/memory/disk/gpu's allocated to run the submitted docker container.

Code in run_docker.py:

mummert-ibm commented 4 years ago

The current plan is to run only a single GPU enabled container at a time on the backend nodes. We will allow them to use all the GPUs, and should limit the CPU and RAM such that the infrastructure components (orchestrator et al) are not starved for resources (which are relatively minimal).

Tentatively: allowed RAM = total RAM - 8GB total disk = if we unlimit this, it will be bound by the filesystem on which docker's /var/lib/docker is found. Which may be fine... need to investigate how to clean up/ garbage collect after an image is removed.

Simon-Harris-IBM commented 4 years ago

Following settings tested by Adam - runtime in the table is based on model runtime directly on the m/c (ie: no synapse):

VM Memory (G) CPUs Shared Memory (G) Run time
Todd 32 7 16 6m54
Todd 32 7 12 6m50
Todd 32 7 8 6m36
Todd 48 7 12 6m41
Todd 32 6 12 8m15
Simon 32 7 12 7m1
Simon 32 7 12 6m50

Based on this, we've decided to go with the following settings:

Attempts to limit cpu consumption to 87% have not worked using the arguments documented in docker-python api docs: https://docker-py.readthedocs.io/en/stable/containers.html. I'm not so concerned about this as CPU spikes to 100% only infrequently during a run.

With the above settings, inference using the example pytorch model takes approx 10m30s from submission to completion using Synapse.