MHubAI / models

Stores the MHub models dockerfiles and scripts.
MIT License
8 stars 16 forks source link

MHub / GC - gc_lunglobes > 1 GPU crash #69

Closed silvandeleemput closed 11 months ago

silvandeleemput commented 1 year ago

Originally posted by @LennyN95 in https://github.com/MHubAI/models/issues/42#issuecomment-1824579306

The issue

When running the MHub docker container with the lobe segmentation code from this repository with the --gpus all flags enabled and having 2 or more GPUs we run into the following error:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/app/models/gc_lunglobes/utils/LobeSegmentationRunner.py", line 50, in task
    handle = segment_lobe_init()
  File "/app/src/test.py", line 1830, in segment_lobe_init
    lobe_seg_instance = LobeSegmentationTSTestCOVID(settings)
  File "/app/src/test.py", line 1524, in __init__
    self.init()
  File "/app/src/test.py", line 700, in init
    self.logger.info("Let's use", torch.cuda.device_count(), "GPUs!")
Message: "Let's use"
Arguments: (2, 'GPUs!')

This error appears to be related to incorrect formatting of the logger.info call. Furthermore, if the logging line is removed we get the following error:

  File "/app/models/gc_lunglobes/utils/LobeSegmentationRunner.py", line 51, in task
    seg_result_np = segment_lobe(handle, img_np, meta_dict)
  File "/app/src/test.py", line 1861, in segment_lobe
    pred = handle.run(transformed_data_dict)
  File "/app/src/test.py", line 1572, in run
    scan_level_inf = self.model.scan_level_inference(pad_scan).cpu().squeeze(0)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'scan_level_inference'

Upon inspecting the latter issue it appears that getting the multi-GPU feature to work properly isn't as simple as wrapping the model with torch.nn.DataParallel because the wrapped model uses custom methods (i.e. scan_level_inference) for inference, which are not picked up by the DataParallel mechanism of PyTorch.

Suggested fix

As fixing the multi-GPU feature properly would be quite some work, the broken multi-GPU feature could be disabled entirely by removal of the following lines:

https://github.com/DIAGNijmegen/bodyct-pulmonary-lobe-segmentation/blob/5a64b70504d46c042c30851a69cec370f1202e67/test.py#L699-L702

LennyN95 commented 1 year ago

Thank you @silvandeleemput for creating the issue.

We get the following error (if # GPUs on host machine > 1):

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.8/logging/__init__.py", line 1085, in emit
    msg = self.format(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
    return fmt.format(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 668, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.8/logging/__init__.py", line 373, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/run.py", line 424, in <module>
    run(config_file)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/run.py", line 365, in run
    module(
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/Module.py", line 77, in execute
    self.task()
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 186, in wrapper
    func(self, instance, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 213, in wrapper
    func(self, instance, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 300, in wrapper
    func(self, instance, *args, **kwargs)
  File "/app/models/gc_lunglobes/utils/LobeSegmentationRunner.py", line 50, in task
    handle = segment_lobe_init()
  File "/app/src/test.py", line 1830, in segment_lobe_init
    lobe_seg_instance = LobeSegmentationTSTestCOVID(settings)
  File "/app/src/test.py", line 1524, in __init__
    self.init()
  File "/app/src/test.py", line 700, in init
    self.logger.info("Let's use", torch.cuda.device_count(), "GPUs!")
Message: "Let's use"
Arguments: (2, 'GPUs!')
LennyN95 commented 1 year ago

The error can be reproduced using the stable MHub release v1:

docker run --rm -it --gpus all -v /absolute/path/to/dicom/data/:/app/data/input_data:ro mhubai/gc_lunglobes:v1 --workflow default --print

Note, that for demonstration purposes, we only need to map an input directory into the container. To review the generated output, an output directory can be specified by adding -v /absolute/path/to/output/folder:/app/data/output_data before the image specification (mhubai/gc_lunglobes:v1).

silvandeleemput commented 1 year ago

Subsequently, if you remove the logger line you get the following error:

ERROR: LobeSegmentationRunner failed processing instance <I:/app/data/sorted_data/1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192>: 'DataParallel' object has no attribute 'scan_level_inference' in Traceback (most recent
call last):
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 186, in wrapper
    func(self, instance, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 213, in wrapper
    func(self, instance, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mhubio/core/IO.py", line 300, in wrapper
    func(self, instance, *args, **kwargs)
  File "/app/models/gc_lunglobes/utils/LobeSegmentationRunner.py", line 51, in task
    seg_result_np = segment_lobe(handle, img_np, meta_dict)
  File "/app/src/test.py", line 1861, in segment_lobe
    pred = handle.run(transformed_data_dict)
  File "/app/src/test.py", line 1572, in run
    scan_level_inf = self.model.scan_level_inference(pad_scan).cpu().squeeze(0)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'scan_level_inference'
silvandeleemput commented 1 year ago

@LennyN95 The bug has been resolved in the original repository under a new release, shall I add this update under #42 or shall I make a new PR?

LennyN95 commented 1 year ago

Is it an entirely new model or an updated version of the original one? Do the ModelCard details still apply to the new release (training, testing, evaluation, ..) and does the model meet our general requirements (Licence, maintenance, ..)? If yes, updating here is fine.

silvandeleemput commented 1 year ago

It is just a bug fix for the multi-GPU support. The model hasn't changed. Everything should be the same. So I'll update it under #42.