Open MMelQin opened 3 years ago
@bhatt-piyush
Some applications require large shared memory.
For example, TRITON docker image is recommented to run docker with the following option
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
E.g, nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/cuda:11.1-runtime-ubuntu20.04
We would need a way to specify shm-size
(maybe with a large default size?) in https://github.com/Project-MONAI/monai-app-sdk/blob/9888a9311013f9bb7a2168670e07a3b05ceea6f1/monai/deploy/runner/runner.py#L98 .
Large shared memory is needed when Pytorch DataLoader class is used in the app.
I don't see the "--rm" option in the statement, maybe it is somewhere else.
I don't see the "--rm" option in the statement, maybe it is somewhere else.
@bhatt-piyush
We may also need to consider executing nvidia-docker
instead of docker
if it requires GPU, to make sure that it uses nvidia docker and gpu is available inside the docker container?
Doesn't Docker already incorporate nvidia-docker, or the latter when installed registers with docker. Anyway, the model I am testing was trained with GPU and scripted with such dependency.
AFAIK, nvidia docker would be used by default only when nvidia docker is specified as a default runner in /etc/docker/daemon.json file.
Otherwise, it would need '-gpus' argument when executing 'docker run'
nvidia-docker2 overwrites the default daemon.json file on installation, so yes, the daemon.json file needs to change, but done automatically. In any case, requirements on nvidia-docker2 then needs to be clearly documented. The sample ai_spleen_seg_app have GPU request via @Resources, but the App Runner does not validate.
Following the conversation above, does this sound right?
The correct App Runner behavior would be:
--shm-size
as an argument in monai-deploy run ...
command.
--ulimit
?--gpus all
flag while launching the container | Or alternatively use nvidia-docker2 run...
I am actually thinking greedy. Instead of exposing the command line options of docker, maybe just have Runner support a single option to take them all in one Runner CLI option. Anyway, so far, defining just the --shm-size for my App image is sufficient, though other app may need the stack size configured with --ulimit
The bigger issue. How does the Packager and App Server know the App needs to define the shm-size? How am I supposed to pass the --shm-size to Packager, as there is no support for it, anyway?
My local runner script for 0.8.1 has --shm-size (for in-proc Torch inference in an app), but for production, only remote inference is used, so the app does not need --shm-size defined, avoiding the issue.
Assignees and todo:
shm-size
.shm-size
.shm-size
support and include it in the new MAP manifest.shm-size
present in MAP manifest when launching jobs.WIP PR to resolve: MAR to specify shm-size present in MAP manifest when launching jobs. (Blocked on @gigony's and @KavinKrishnan's changes) https://github.com/Project-MONAI/monai-deploy-app-sdk/pull/145
Observed: Often times AI inference app require more shared memory, and it has been observed that the example ai_spleen_seg_app failed when run by the App Runner due to not setting shm on spinning up the app docker.
Expected: The App Package can be run
How to reproduce: Get the named example app Package it into App Package Run the package with monai-deploy run, see below