Project-MONAI / MONAILabel

MONAI Label is an intelligent open source image labeling and learning tool.
https://docs.monai.io/projects/label
Apache License 2.0
610 stars 196 forks source link

Running inference with DiNTS Model Zoo bundle takes minutes (pancreas_ct_dints_segmentation) #1168

Closed nvahmadi closed 1 year ago

nvahmadi commented 1 year ago

Describe the bug Running the pancreas_ct_dints_segmentation model bundle from Model Zoo in MONAI-Label works in principle, but inference takes minutes, which makes the model not really useful in an interactive setting like MONAI-Label (not sure where to place this issue - MONAI Core vs Label vs Model Zoo repos?). I almost seems as if inference was running on CPU, but it seems that the GPU is being used. Haven't even gotten to trying a training run. I am working with a client who reported this issue, and I was able to reproduce it.

To Reproduce Run pancreas_ct_dints_segmentation pretrained model from the zoo, on any abdominal CT volume (e.g. from the MSD dataset which it was trained on, leakage is not important in this context)

Expected behavior Inference should run in seconds, but it takes minutes.

Environment MONAI-Label docker container v0.5.2, running default monaibundle app. (HW: DGX Station V100 4x32GB).

SachidanandAlle commented 1 year ago

Can you paste the monailabel server logs when you ran the inference

nvahmadi commented 1 year ago

Sure, this is the log for inference of a single vol:

[2022-12-02 12:06:35,657] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.app:38) - Initializing App from: /data/Projects/MONAI_label/apps/v0.5.2/monaibundle; studies: /data/Projects/MONAI_label/datasets/Pancreas_MSD_10samples; conf: {'models': 'pancreas_ct_dints_segmentation_v0.3.5', 'multi_gpu': 'true'}
[2022-12-02 12:06:35,916] [1322] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for MONAILabelApp Found: <class 'main.MyApp'>
[2022-12-02 12:06:36,315] [1322] [MainThread] [INFO] (main:78) - +++ Adding Bundle from Zoo: pancreas_ct_dints_segmentation_v0.3.5 => {'checksum': '992028f4dda21035a1bed264a8abbc5cea704020', 'source': 'https://github.com/Project-MONAI/model-zoo/releases/download/hosting_storage_v1/pancreas_ct_dints_segmentation_v0.3.5.zip'} => /data/Projects/MONAI_label/apps/v0.5.2/monaibundle/model/pancreas_ct_dints_segmentation_v0.3.5
[2022-12-02 12:06:36,317] [1322] [MainThread] [INFO] (main:88) - +++ Using Models: ['pancreas_ct_dints_segmentation_v0.3.5']
[2022-12-02 12:06:36,318] [1322] [MainThread] [INFO] (monailabel.interfaces.app:129) - Init Datastore for: /data/Projects/MONAI_label/datasets/Pancreas_MSD_10samples
[2022-12-02 12:06:36,324] [1322] [MainThread] [INFO] (monailabel.datastore.local:128) - Auto Reload: True; Extensions: ['*.nii.gz', '*.nii', '*.nrrd', '*.jpg', '*.png', '*.tif', '*.svs', '*.xml']
[2022-12-02 12:06:36,668] [1322] [MainThread] [INFO] (monailabel.datastore.local:574) - Invalidate count: 0
[2022-12-02 12:06:36,668] [1322] [MainThread] [INFO] (monailabel.datastore.local:148) - Start observing external modifications on datastore (AUTO RELOAD)
[2022-12-02 12:06:40,988] [1322] [MainThread] [INFO] (main:106) - +++ Adding Inferer:: pancreas_ct_dints_segmentation_v0.3.5 => <monailabel.tasks.infer.bundle.BundleInferTask object at 0x7ff86cc80ee0>
[2022-12-02 12:06:41,318] [1322] [MainThread] [INFO] (main:120) - +++ Adding Trainer:: pancreas_ct_dints_segmentation_v0.3.5 => <monailabel.tasks.train.bundle.BundleTrainTask object at 0x7ff86cc80df0>
[2022-12-02 12:06:41,318] [1322] [MainThread] [INFO] (main:130) - Active Learning Strategies:: ['random', 'first']
[2022-12-02 12:06:41,324] [1322] [MainThread] [INFO] (monailabel.utils.sessions:51) - Session Path: /root/.cache/monailabel/sessions
[2022-12-02 12:06:41,324] [1322] [MainThread] [INFO] (monailabel.utils.sessions:52) - Session Expiry (max): 3600
[2022-12-02 12:06:41,325] [1322] [MainThread] [INFO] (monailabel.interfaces.app:467) - App Init - completed
[2022-12-02 12:06:41,325] [1322] [MainThread] [INFO] (timeloop:60) - Starting Timeloop..
[2022-12-02 12:06:41,326] [1322] [MainThread] [INFO] (timeloop:42) - Registered job <function MONAILabelApp.on_init_complete.<locals>.run_scheduler at 0x7ff86cc75c10>
[2022-12-02 12:06:41,326] [1322] [MainThread] [INFO] (timeloop:63) - Timeloop now started. Jobs will run based on the interval set
[2022-12-02 12:07:16,516] [1322] [MainThread] [INFO] (monailabel.endpoints.activelearning:43) - Active Learning Request: {'strategy': 'random', 'client_id': 'user-xyz'}
[2022-12-02 12:07:16,578] [1322] [MainThread] [INFO] (monailabel.tasks.activelearning.random:45) - Random: Selected Image: pancreas_421; Weight: 842803
[2022-12-02 12:07:16,614] [1322] [MainThread] [INFO] (monailabel.endpoints.activelearning:59) - Next sample: {'id': 'pancreas_421', 'weight': 842803, 'path': '/data/Projects/MONAI_label/datasets/Pancreas_MSD_10samples/pancreas_421.nii.gz', 'ts': 1669139452, 'name': 'pancreas_421.nii.gz'}
[2022-12-02 12:07:25,195] [1322] [MainThread] [INFO] (monailabel.endpoints.infer:160) - Infer Request: {'model': 'pancreas_ct_dints_segmentation_v0.3.5', 'image': 'pancreas_421', 'device': 'cuda', 'result_extension': '.nrrd', 'result_dtype': 'uint8', 'client_id': 'user-xyz'}
[2022-12-02 12:07:25,207] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:258) - Infer Request (final): {'device': 'cuda', 'model': 'pancreas_ct_dints_segmentation_v0.3.5', 'image': '/data/Projects/MONAI_label/datasets/Pancreas_MSD_10samples/pancreas_421.nii.gz', 'result_extension': '.nrrd', 'result_dtype': 'uint8', 'client_id': 'user-xyz', 'description': 'Searched architectures for volumetric (3D) segmentation of the pancreas from CT image'}
[2022-12-02 12:07:25,225] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:63) - PRE - Run Transform(s)
[2022-12-02 12:07:25,226] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:64) - PRE - Input Keys: ['device', 'model', 'image', 'result_extension', 'result_dtype', 'client_id', 'description', 'image_path']
[2022-12-02 12:07:29,305] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (LoadImageTensord): Time: 4.0786; image: (512, 512, 131)(torch.float32)
[2022-12-02 12:07:29,307] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (EnsureChannelFirstd): Time: 0.0007; image: (1, 512, 512, 131)(torch.float32)
[2022-12-02 12:07:29,336] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (Orientationd): Time: 0.0284; image: (1, 512, 512, 131)(torch.float32)
[2022-12-02 12:07:35,167] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (Spacingd): Time: 5.831; image: (1, 498, 498, 261)(torch.float32)
[2022-12-02 12:07:35,437] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (ScaleIntensityRanged): Time: 0.2687; image: (1, 498, 498, 261)(torch.float32)
[2022-12-02 12:07:35,438] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - PRE - Transform (EnsureTyped): Time: 0.0002; image: (1, 498, 498, 261)(torch.float32)
[2022-12-02 12:07:35,444] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:426) - Inferer:: cuda => SlidingWindowInferer => {'roi_size': [96, 96, 96], 'sw_batch_size': 4, 'overlap': 0.625, 'mode': constant, 'sigma_scale': 0.125, 'padding_mode': constant, 'cval': 0.0, 'sw_device': None, 'device': None, 'progress': False, 'cpu_thresh': None, 'roi_weight_map': None}
[2022-12-02 12:07:35,447] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:375) - Infer model path: /data/Projects/MONAI_label/apps/v0.5.2/monaibundle/model/pancreas_ct_dints_segmentation_v0.3.5/models/model.pt
[2022-12-02 12:11:38,625] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:63) - POST - Run Transform(s)
[2022-12-02 12:11:38,627] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:64) - POST - Input Keys: ['device', 'model', 'image', 'result_extension', 'result_dtype', 'client_id', 'description', 'image_path', 'image_meta_dict', 'latencies', 'pred']
[2022-12-02 12:11:38,688] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - POST - Transform (Activationsd): Time: 0.0604; image: (1, 498, 498, 261)(torch.float32); pred: (3, 498, 498, 261)(torch.float32)
[2022-12-02 12:11:39,674] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - POST - Transform (Invertd): Time: 0.9802; image: (1, 498, 498, 261)(torch.float32); pred: (3, 512, 512, 131)(torch.float32)
[2022-12-02 12:11:46,445] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - POST - Transform (AsDiscreted): Time: 6.7707; image: (1, 498, 498, 261)(torch.float32); pred: (1, 512, 512, 131)(torch.float32)
[2022-12-02 12:11:46,447] [1322] [MainThread] [INFO] (monailabel.interfaces.utils.transform:109) - POST - Transform (Restored): Time: 0.0007; image: (1, 498, 498, 261)(torch.float32); pred: (512, 512, 131)(torch.float32)
[2022-12-02 12:11:46,448] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:458) - Writing Result...
[2022-12-02 12:11:46,454] [1322] [MainThread] [INFO] (monailabel.transform.writer:188) - Result ext: .nrrd; write_to_file: True; dtype: uint8
[2022-12-02 12:11:50,454] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:290) - ++ Latencies => Total: 265.2478; Pre: 10.2310; Inferer: 243.1687; Invert: 0.0000; Post: 7.8406; Write: 4.0058
[2022-12-02 12:11:50,454] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:314) - Result File: /tmp/tmpybjkvadd.nrrd
[2022-12-02 12:11:50,454] [1322] [MainThread] [INFO] (monailabel.tasks.infer.basic_infer:315) - Result Json Keys: ['label_names', 'latencies']
SachidanandAlle commented 1 year ago

You can see from logs.. inference itself took 4 mins... That's very huge.. how about 2nd infer request (after model is cached into GPU).. I think latency is coming the network. If you have GPU memory a bit more.. you can increase the roi size for sliding window inferer.. reduce overlap size..

but still something is not correct in the network side.. I doubt some operations are not fast or not running on GPU

Latencies => Total: 265.2478; Pre: 10.2310; Inferer: 243.1687; Invert: 0.0000; Post: 7.8406; Write: 4.0058

You can see latency for each transform and every stage... This should help you to dissect the problem.

tangy5 commented 1 year ago

I see similar when using this bundle, the bundle itself might have some high overlap with slidingwindow inference. The pre-, post- transform works normal. See default overlap is 0.625, it's high and inference will be slow if a larger CT is fed. This is a trade-off regarding efficiency and performance. Do we need to do something to limit the overlap for bundles (to guarantee the user experience in MONAI Label)? @SachidanandAlle

SachidanandAlle commented 1 year ago

No... That goes as part of bundle or user of those bundle to take care…. Bundle and insights of bundle are abstract to label

SachidanandAlle commented 1 year ago

closing this issue.. if you are looking to fix the default window size, overlap etc.. feel free to raise a similar issue on model-zoo