Issues with heuristic_planner on a multiple GPU setup

saileshsidhwani commented 1 year ago

Describe the bug When using 'heuristic_planner' with the sample radiology app, starting the monailabel server errors out on my system with the following error:

  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\utils\others\planner.py", line 97, in _get_target_img_size
    width_base_2 = int(2 ** np.round(np.log2(width)))
ValueError: cannot convert float NaN to integer

My machine has 2 different GPUs, the first one is a 2GB Quadro P600 and the second one is a 24GB NVIDIA TITAN RTX.

Digging more into this I noticed that the _get_target_img_size method in monailabel.utils.others.planner class always uses the first GPU on the system to compute available GPU memory for heuristic purposes.

def _get_target_img_size(target_img_size):
        # This should return an image according to the free gpu memory available
        # Equation obtained from curve fitting using table:
        # https://tinyurl.com/tableGPUMemory
        gpu_mem = gpu_memory_map()[0]

This is not ideal as the first GPU on my system is only 2GB and the math in the above function leads to errors. For training/inference, I always set the cuda device to make sure the right GPU is used instead (RTX in my case).

If in the above code, I replace what GPU to use, everything works as expected gpu_mem = gpu_memory_map()[0] with gpu_mem = gpu_memory_map()[1]

This leads me to believe that It would be best if I can somehow tell the planner what GPU to use to compute the heuristic.

Server logs

[2023-03-21 16:00:44,290] [23536] [MainThread] [INFO] (uvicorn.error:75) - Started server process [23536]
[2023-03-21 16:00:44,291] [23536] [MainThread] [INFO] (uvicorn.error:45) - Waiting for application startup.
[2023-03-21 16:00:44,291] [23536] [MainThread] [INFO] (monailabel.interfaces.utils.app:38) - Initializing App from: \\batfs-sb09-cifs\vmgr\sb09\ssidhwan\VirtualEnvs\Windows\Cuda11_3_Torch1_12\monailabel\sample-apps\radiology; studies: C:\Users\ssidhwan\Desktop\imagesTr; conf: {'models': 'segmentation_spleen', 'heuristic_planner': 'true'}
[2023-03-21 16:00:45,795] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for MONAILabelApp Found: <class 'main.MyApp'>
[2023-03-21 16:00:46,019] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.deepedit.DeepEdit'>
[2023-03-21 16:00:46,032] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.deepgrow_2d.Deepgrow2D'>
[2023-03-21 16:00:46,064] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.deepgrow_3d.Deepgrow3D'>
[2023-03-21 16:00:46,078] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.localization_spine.LocalizationSpine'>
[2023-03-21 16:00:46,100] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.localization_vertebra.LocalizationVertebra'>
[2023-03-21 16:00:46,112] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.segmentation.Segmentation'>
[2023-03-21 16:00:46,133] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.segmentation_spleen.SegmentationSpleen'>
[2023-03-21 16:00:46,143] [23536] [MainThread] [INFO] (monailabel.utils.others.class_utils:37) - Subclass for TaskConfig Found: <class 'lib.configs.segmentation_vertebra.SegmentationVertebra'>
[2023-03-21 16:00:46,160] [23536] [MainThread] [INFO] (main:93) - +++ Adding Model: segmentation_spleen => lib.configs.segmentation_spleen.SegmentationSpleen
[2023-03-21 16:00:46,325] [23536] [MainThread] [INFO] (lib.configs.segmentation_spleen:75) - EPISTEMIC Enabled: False; Samples: 5
[2023-03-21 16:00:46,330] [23536] [MainThread] [INFO] (main:96) - +++ Using Models: ['segmentation_spleen']
[2023-03-21 16:00:46,335] [23536] [MainThread] [INFO] (monailabel.interfaces.app:135) - Init Datastore for: C:\Users\ssidhwan\Desktop\imagesTr
[2023-03-21 16:00:46,339] [23536] [MainThread] [INFO] (monailabel.datastore.local:129) - Auto Reload: True; Extensions: ['*.nii.gz', '*.nii', '*.nrrd', '*.jpg', '*.png', '*.tif', '*.svs', '*.xml']
[2023-03-21 16:00:46,385] [23536] [MainThread] [INFO] (monailabel.datastore.local:576) - Invalidate count: 0
[2023-03-21 16:00:46,399] [23536] [MainThread] [INFO] (monailabel.datastore.local:150) - Start observing external modifications on datastore (AUTO RELOAD)
[2023-03-21 16:00:46,406] [23536] [MainThread] [INFO] (monailabel.utils.others.planner:36) - Reading datastore metadata for heuristic planner...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:08<00:00,  1.12it/s]
[2023-03-21 16:00:55,324] [23536] [MainThread] [INFO] (monailabel.utils.others.generic:161) - Using nvidia-smi command
[2023-03-21 16:00:55,461] [23536] [MainThread] [INFO] (monailabel.utils.others.planner:72) - Available GPU memory: {0: 177, 1: 23853} in MB
[2023-03-21 16:00:55,466] [23536] [MainThread] [INFO] (monailabel.utils.others.generic:161) - Using nvidia-smi command
-28.484375
invalid value encountered in log2
[2023-03-21 16:00:55,696] [23536] [MainThread] [ERROR] (uvicorn.error:119) - Traceback (most recent call last):
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\starlette\routing.py", line 635, in lifespan
    async with self.lifespan_context(app):
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\starlette\routing.py", line 530, in __aenter__
    await self._router.startup()
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\starlette\routing.py", line 612, in startup
    await handler()
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\app.py", line 106, in startup_event
    instance = app_instance()
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\interfaces\utils\app.py", line 51, in app_instance
    app = c(app_dir=app_dir, studies=studies, conf=conf)
  File "\\batfs-sb09-cifs\vmgr\sb09\ssidhwan\VirtualEnvs\Windows\Cuda11_3_Torch1_12\monailabel\sample-apps\radiology\main.py", line 101, in __init__
    super().__init__(
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\interfaces\app.py", line 99, in __init__
    self._datastore: Datastore = self.init_datastore()
  File "\\batfs-sb09-cifs\vmgr\sb09\ssidhwan\VirtualEnvs\Windows\Cuda11_3_Torch1_12\monailabel\sample-apps\radiology\main.py", line 113, in init_datastore
    self.planner.run(datastore)
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\utils\others\planner.py", line 75, in run
    self.spatial_size = self._get_target_img_size(np.mean(img_sizes, 0, np.int64))
  File "S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monailabel\utils\others\planner.py", line 97, in _get_target_img_size
    width_base_2 = int(2 ** np.round(np.log2(width)))
ValueError: cannot convert float NaN to integer

[2023-03-21 16:00:55,697] [23536] [MainThread] [ERROR] (uvicorn.error:56) - Application startup failed. Exiting.

To Reproduce Steps to reproduce the behavior:

Go to On WIndows 10

Install

$ python -m pip install --upgrade pip setuptools wheel
$ pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
$ pip install monailabel

Run commands monailabel start_server --app .\monailabel\sample-apps\radiology\ --studies 'C:\Users\ssidhwan\Desktop\imagesTr\' --conf models segmentation_spleen --conf heuristic_planner true

Expected behavior A way to tell heuristic planner on what GPU(s) to use insted of always using the first GPU

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.23.5
Pytorch version: 1.12.1+cu113
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12\lib\site-packages\monai\__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 5.0.1
scikit-image version: 0.20.0
Pillow version: 9.4.0
Tensorboard version: 2.12.0
gdown version: 4.6.4
TorchVision version: 0.13.1+cu113
tqdm version: 4.65.0
lmdb version: 1.4.0
psutil version: 5.9.4
pandas version: 1.5.3
einops version: 0.6.0
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: 2.2.1
pynrrd version: 0.4.3

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Windows
Win32 version: ('10', '10.0.19045', 'SP0', 'Multiprocessor Free')
Win32 edition: Enterprise
Platform: Windows-10-10.0.19045-SP0
Processor: Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
Machine: AMD64
Python version: 3.9.13
Process name: python.exe
Command: ['C:\\Users\\ssidhwan\\AppData\\Local\\Programs\\Python\\Python39\\python.exe', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='C:\\Windows\\System32\\en-US\\kernel32.dll.mui', fd=-1), popenfile(path='C:\\Windows\\System32\\en-US\\tzres.dll.mui', fd=-1), popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1)]
Num physical CPUs: 6
Num logical CPUs: 12
Num usable CPUs: 12
CPU usage (%): [47.1, 8.5, 23.4, 10.2, 24.1, 16.2, 21.3, 26.9, 19.8, 10.4, 22.5, 28.5]
CPU freq. (MHz): 3600
Load avg. in last 1, 5, 15 mins (%): [0.0, 0.0, 0.0]
Disk usage (%): 30.9
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 63.7
Available memory (GB): 34.0
Used memory (GB): 29.6

================================
Printing GPU config...
================================
Num GPUs: 2
Has CUDA: True
CUDA version: 11.3
cuDNN enabled: True
cuDNN version: 8302
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_37']
GPU 0 Name: NVIDIA TITAN RTX
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 72
GPU 0 Total memory (GB): 24.0
GPU 0 CUDA capability (maj.min): 7.5
GPU 1 Name: Quadro P600
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 3
GPU 1 Total memory (GB): 2.0
GPU 1 CUDA capability (maj.min): 6.1

Additional context Add any other context about the problem here.

$ nvidia-smi 
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 516.94       Driver Version: 516.94       CUDA Version: 11.7     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. | 
|===============================+======================+======================| 
|   0  Quadro P600        WDDM  | 00000000:C1:00.0  On |                  N/A | 
| 34%   47C    P5    N/A /  N/A |   1723MiB /  2048MiB |     13%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   1  NVIDIA TITAN RTX   WDDM  | 00000000:E1:00.0 Off |                  N/A | 
| 41%   29C    P8     8W / 280W |   2069MiB / 24576MiB |      0%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+

tangy5 commented 1 year ago

Does export CUDA_VISIBLE_DEVICES=1 before running monailabel start-server command, or run CUDA_VISIBLE_DEVICES=1 monailabel start_server .... works on the case to avoid using GPU 0?

saileshsidhwani commented 1 year ago

Gave this a try:

$(Cuda11_3_Torch1_12) PS S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12> $Env:CUDA_VISIBLE_DEVICES = 1
$(Cuda11_3_Torch1_12) PS S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12> $Env:CUDA_VISIBLE_DEVICES
1
$(Cuda11_3_Torch1_12) PS S:\VirtualEnvs\Windows\Cuda11_3_Torch1_12> monailabel start_server --app .\monailabel\sample-apps\radiology\ --studies 'C:\Users\ssidhwan\Desktop\imagesTr\' --conf models segmentation_spleen --conf heuristic_planner true

But ran into the same issue. Looking at this [MainThread] [INFO] (monailabel.utils.others.generic:161) - Using nvidia-smi command I believe it is using nvidia-smi command to get the available GPU memory. Setting the CUDA_VISIBLE_DEVICES environment variable does not really affect the nvidia-smi output

tangy5 commented 1 year ago

Thanks for the updates.

Hi @diazandr3s , do you familiar with this planner? Looks like this index is fixed, if possible we can test and update so that users can have flexibility for GPUs.

https://github.com/Project-MONAI/MONAILabel/blob/853e4343c1387dd700e220b987b9146d8d4b1e0f/monailabel/utils/others/planner.py#L89

If this makes sense to you, I can create a PR for this. Thanks!

diazandr3s commented 1 year ago

Thanks for reporting this @saileshsidhwani. Indeed the index is fixed, @tangy5 The heuristic planner defines spacing and ROI. It is still in its early stages though. I'd suggest you use the default ROI and spacing as you have quite a good GPU. @tangy5, the heuristic planner may need some updates to also define pre-transform/data augmentation arguments. Happy to chat more about this.

Project-MONAI / MONAILabel

Issues with heuristic_planner on a multiple GPU setup #1349