The App Runner needs to support setting shm when running App package

MMelQin commented 3 years ago

Observed: Often times AI inference app require more shared memory, and it has been observed that the example ai_spleen_seg_app failed when run by the App Runner due to not setting shm on spinning up the app docker.

Expected: The App Package can be run

How to reproduce: Get the named example app Package it into App Package Run the package with monai-deploy run, see below

mqin@mingq-dt:~/src/monai-app-sdk/examples/apps$ monai-deploy run spleen_seg_app:0.1.1 ai_spleen_seg_app/input output
Checking dependencies...
--> Verifying if "docker" is installed...

--> Verifying if "spleen_seg_app:0.1.1" is available...

Checking for MAP "spleen_seg_app:0.1.1" locally
"spleen_seg_app:0.1.1" found.

Reading MONAI App Package manifest...
 > export '/var/run/monai/export/' detected
DEBUG:__main__.AISpleenSegApp:Begin compose
DEBUG:__main__.AISpleenSegApp:End compose
DEBUG:__main__.AISpleenSegApp:App Path: /opt/monai/app/app.py,             Input: /opt/monai/app/input,             Output: /opt/monai/app/output,            Models: /opt/monai/models
DEBUG:__main__.AISpleenSegApp:Begin run
Going to initiate execution of operator DICOMDataLoaderOperator
Executing operator DICOMDataLoaderOperator (Process ID 1)
Done performing execution of operator DICOMDataLoaderOperator

Going to initiate execution of operator DICOMSeriesSelectorOperator
Executing operator DICOMSeriesSelectorOperator (Process ID 1)
Done performing execution of operator DICOMSeriesSelectorOperator

Going to initiate execution of operator DICOMSeriesToVolumeOperator
Executing operator DICOMSeriesToVolumeOperator (Process ID 1)
Done performing execution of operator DICOMSeriesToVolumeOperator

Going to initiate execution of operator SpleenSegOperator
Executing operator SpleenSegOperator (Process ID 1)
pre-transform:
{'transforms': (<monai.transforms.io.dictionary.LoadImaged object at 0x7f90d4632f40>, <monai.transforms.utility.dictionary.EnsureChannelFirstd object at 0x7f90d4632f70>, <monai.transforms.spatial.dictionary.Spacingd object at 0x7f90d465e310>, <monai.transforms.intensity.dictionary.ScaleIntensityRanged object at 0x7f90d465e340>, <monai.transforms.croppad.dictionary.CropForegroundd object at 0x7f90d465e580>, <monai.transforms.utility.dictionary.ToTensord object at 0x7f90d465e610>), 'map_items': True, 'unpack_items': False, 'R': RandomState(MT19937) at 0x7F90E41DD140}
post-transform:
{'transforms': (<monai.transforms.post.dictionary.Activationsd object at 0x7f90d465e730>, <monai.transforms.post.dictionary.AsDiscreted object at 0x7f90d465e8e0>, <monai.transforms.post.dictionary.Invertd object at 0x7f90d465eac0>, <monai.transforms.io.dictionary.SaveImaged object at 0x7f90d465eaf0>), 'map_items': True, 'unpack_items': False, 'R': RandomState(MT19937) at 0x7F90E41DD140}
Model path: /opt/monai/models/model/model.ts
Model name (expected None): model
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 41) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/monai/app/app.py", line 72, in <module>
    app_instance.run()
  File "/opt/monai/app/app.py", line 39, in run
    super().run()
  File "/home/monai/.local/lib/python3.8/site-packages/monai/deploy/core/application.py", line 316, in run
    executor.run()
  File "/home/monai/.local/lib/python3.8/site-packages/monai/deploy/core/executors/single_process_executor.py", line 73, in run
    op.compute(op_exec_context.input, op_exec_context.output, op_exec_context)
  File "/opt/monai/app/spleen_seg_operator.py", line 87, in compute
    infer_operator.compute(input, output, context)
  File "/home/monai/.local/lib/python3.8/site-packages/monai/deploy/operators/monai_seg_inference_operator.py", line 171, in compute
    for d in dataloader:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 41) exited unexpectedly

ERROR: MONAI Application "spleen_seg_app:0.1.1" failed.

gigony commented 3 years ago

@bhatt-piyush Some applications require large shared memory. For example, TRITON docker image is recommented to run docker with the following option --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 E.g, nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/cuda:11.1-runtime-ubuntu20.04

We would need a way to specify shm-size (maybe with a large default size?) in https://github.com/Project-MONAI/monai-app-sdk/blob/9888a9311013f9bb7a2168670e07a3b05ceea6f1/monai/deploy/runner/runner.py#L98 .

Large shared memory is needed when Pytorch DataLoader class is used in the app.

MMelQin commented 3 years ago

I don't see the "--rm" option in the statement, maybe it is somewhere else.

gigony commented 3 years ago

I don't see the "--rm" option in the statement, maybe it is somewhere else.

It's in https://github.com/Project-MONAI/monai-app-sdk/blob/9888a9311013f9bb7a2168670e07a3b05ceea6f1/monai/deploy/runner/runner.py#L75

gigony commented 3 years ago

@bhatt-piyush

We may also need to consider executing nvidia-docker instead of docker if it requires GPU, to make sure that it uses nvidia docker and gpu is available inside the docker container?

MMelQin commented 3 years ago

Doesn't Docker already incorporate nvidia-docker, or the latter when installed registers with docker. Anyway, the model I am testing was trained with GPU and scripted with such dependency.

gigony commented 3 years ago

AFAIK, nvidia docker would be used by default only when nvidia docker is specified as a default runner in /etc/docker/daemon.json file.

Otherwise, it would need '-gpus' argument when executing 'docker run'

MMelQin commented 3 years ago

nvidia-docker2 overwrites the default daemon.json file on installation, so yes, the daemon.json file needs to change, but done automatically. In any case, requirements on nvidia-docker2 then needs to be clearly documented. The sample ai_spleen_seg_app have GPU request via @Resources, but the App Runner does not validate.

bhatt-piyush commented 3 years ago

Following the conversation above, does this sound right?

The correct App Runner behavior would be:

Expose --shm-size as an argument in monai-deploy run ... command.
- What other arguments do we want to expose? maybe --ulimit?
Verify if GPUs have been requested in the app manifest:
- If yes, use --gpus all flag while launching the container | Or alternatively use nvidia-docker2 run...

MMelQin commented 3 years ago

I am actually thinking greedy. Instead of exposing the command line options of docker, maybe just have Runner support a single option to take them all in one Runner CLI option. Anyway, so far, defining just the --shm-size for my App image is sufficient, though other app may need the stack size configured with --ulimit

The bigger issue. How does the Packager and App Server know the App needs to define the shm-size? How am I supposed to pass the --shm-size to Packager, as there is no support for it, anyway?

My local runner script for 0.8.1 has --shm-size (for in-proc Torch inference in an app), but for production, only remote inference is used, so the app does not need --shm-size defined, avoiding the issue.

bhatt-piyush commented 3 years ago

Assignees and todo:

[ ] @gigony: App SDK to allow users to specify shm-size.
[x] @whoisj: MAP manifest specification to include shm-size.
[ ] @KavinKrishnan: Packager to include shm-size support and include it in the new MAP manifest.
[x] @bhatt-piyush: MAR to specify shm-size present in MAP manifest when launching jobs.

bhatt-piyush commented 3 years ago

WIP PR to resolve: MAR to specify shm-size present in MAP manifest when launching jobs. (Blocked on @gigony's and @KavinKrishnan's changes) https://github.com/Project-MONAI/monai-deploy-app-sdk/pull/145

Project-MONAI / monai-deploy-app-sdk

The App Runner needs to support setting shm when running App package #66