VISTA-3D: Non-reproducible Dice Scores

drbeh commented 2 weeks ago

**Describe the bug

We tried to benchmark VISTA-3D for accuracy (dice score), so we ran one locally to create baselines and another one in our CI pipeline to create benchmarks. However, we realized that we cannot reproduce these metrics and baseline and benchmark differ:

Test Case	calss	Baseline Dice	Benchmark Dice
case1	spleen	0.9652	0.964
case2	aorta	0.9682	0.969
case3	liver	0.9177	0.92
	hepatic tumor	0.684	0.703
case4	lung tumor	0.8779	0.863
case5	colon cancer primaries	0.8421	0.801
case6	stomach	0.936	0.932
	inferior vena cava	0.9087	0.901
	pancreas	0.781	0.703
	vertebrae L1	0.9793	0.979
	vertebrae T8	0.9769	0.979
	brain	0.8203	0.849

Here are the tests cases that we used:

test_case:
  case1:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/spleen_2_image.nii.gz
    label: spleen_2_label.nii.gz
    gt:
      spleen: 1
    prompts: 
      classes: 
      - spleen   

  case2:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/s0996.nii
    label: s0996_seg.nii
    gt:
      aorta: 7
    prompts: 
      classes: 
      - aorta

  case3:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/liver_129.nii.gz
    label: liver_129_seg.nii.gz
    gt:
      liver: 1
      hepatic tumor: 2
    prompts: 
      classes: 
      - liver 
      - hepatic tumor

  case4:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/lung_034.nii.gz
    label: lung_034_seg.nii.gz
    gt:
      lung tumor: 1
    prompts: 
      classes: 
      - lung tumor

  case5:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/colon_203.nii
    label: colon_203_seg.nii
    gt:
      colon cancer primaries: 1
    prompts: 
      classes: 
      - colon cancer primaries

  case6:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/s0459.nii
    label: s0459_seg.nii
    gt:
      stomach: 6
      inferior vena cava: 8
      pancreas: 10
      vertebrae L1: 22
      vertebrae T8: 27
      brain: 44

    prompts:
      classes:
      - stomach
      - inferior vena cava
      - pancreas
      - vertebrae L1
      - vertebrae T8
      - brain

  case7:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/s0675.nii
    label: s0675_seg.nii
    gt:
      left rib 8: 59
      right rib 3: 66
      right rib 12: 75
      right iliopsoas: 96
      heart: 105
    prompts:
      classes:
      - left rib 8
      - right rib 3
      - right rib 12
      - right iliopsoas
      - heart

and here are the test cases for speed:

test_case:
  case1:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/256cubic.nii.gz
    prompts: 
      classes: []
    size: [256, 256, 256]

  case2:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/256cubic.nii.gz
    prompts: 
      classes:
      - spleen
    size: [256, 256, 256]

  case3:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/512cubic.nii.gz
    prompts: 
      classes: []
    size: [512, 512, 512]

  case4:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/512cubic.nii.gz
    prompts: 
      classes:
      - liver
    size: [512, 512, 512]

  case5:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/512-768.nii.gz
    prompts: 
      classes: []
    size: [512, 512, 768]

  case6:
    image: https://vista3d-nim-test-images.s3.amazonaws.com/test-vista3d-nim-image-url/512-768.nii.gz
    prompts: 
      classes:
      - heart
    size: [512, 512, 768]

Environment

The baseline and banchmak are being run on different machines but the same container.

================================
Printing MONAI config...
================================
MONAI version: 1.4.0rc9
Numpy version: 1.24.4
Pytorch version: 2.5.0a0+872d972e41.nv24.08.01
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: fa1c1af79ef5387434f2a76744f75b5aaca09f0b
MONAI __file__: /usr/local/lib/python3.10/dist-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.4.0
Nibabel version: 5.2.1
scikit-image version: 0.23.2
scipy version: 1.13.1
Pillow version: 10.4.0
Tensorboard version: 2.17.0
gdown version: 5.2.0
TorchVision version: 0.20.0a0
tqdm version: 4.66.4
lmdb version: 1.5.1
psutil version: 5.9.8
pandas version: 2.2.2
einops version: 0.7.0
transformers version: 4.40.2
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 1.0.0
clearml version: 1.16.3

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.5 LTS
Platform: Linux-6.8.0-41-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.10.12
Process name: pt_main_thread
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 8
Num logical CPUs: 16
Num usable CPUs: 16
CPU usage (%): [4.2, 2.8, 2.5, 2.1, 2.5, 2.6, 2.1, 17.5, 2.6, 2.6, 2.1, 2.0, 2.1, 1.8, 78.3, 7.2]
CPU freq. (MHz): 1839
Load avg. in last 1, 5, 15 mins (%): [3.5, 1.8, 0.6]
Disk usage (%): 38.2
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 125.7
Available memory (GB): 121.3
Used memory (GB): 2.9

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 12.6
cuDNN enabled: True
NVIDIA_TF32_OVERRIDE: None
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: 1
cuDNN version: 90400
Current device: 0
Library compiled for CUDA architectures: ['sm_70', 'sm_72', 'sm_75', 'sm_80', 'sm_86', 'sm_87', 'sm_90', 'compute_90']
GPU 0 Name: NVIDIA A40
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 44.4
GPU 0 CUDA capability (maj.min): 8.6

yiheng-wang-nv commented 1 week ago

Hi @drbeh , I use my machine to run bundle inference on these cases, and got the results:

[{'spleen': 0.9641091227531433},
 {'aorta': 0.9692426323890686},
 {'liver': 0.9203331470489502, 'hepatic tumor': 0.7033719420433044},
 {'lung tumor': 0.8624954223632812},
 {'colon cancer primaries': 0.8006274104118347},
 {'stomach': 0.932415246963501,
  'inferior vena cava': 0.9008662700653076,
  'pancreas': 0.7037838101387024,
  'vertebrae L1': 0.9787881374359131,
  'vertebrae T8': 0.9794131517410278,
  'brain': 0.8486177921295166},
 {'left rib 8': 0.9291888475418091,
  'right rib 3': 0.9454008936882019,
  'right rib 12': 0.9586374759674072,
  'right iliopsoas': 0.8804903030395508,
  'heart': 0.9503232836723328}]

It's the same as your benchmark data (difference from baseline data). In addition, the inference results are reproducible according to my multiple rounds of test.

yiheng-wang-nv commented 1 week ago

the above data is produced by non-tensorrt model. For tensorrt inference, the results are also similar (and can reproduce):

 {'aorta': 0.9691897630691528},
 {'liver': 0.9203130006790161, 'hepatic tumor': 0.7031749486923218},
 {'lung tumor': 0.8627061247825623},
 {'colon cancer primaries': 0.8004928827285767},
 {'stomach': 0.9326068758964539,
  'inferior vena cava': 0.9012161493301392,
  'pancreas': 0.7041643857955933,
  'vertebrae L1': 0.9787870049476624,
  'vertebrae T8': 0.9794606566429138,
  'brain': 0.8487906455993652},
 {'left rib 8': 0.9273821115493774,
  'right rib 3': 0.9454008936882019,
  'right rib 12': 0.9577922224998474,
  'right iliopsoas': 0.8795918226242065,
  'heart': 0.9503096342086792}]

Project-MONAI / model-zoo

VISTA-3D: Non-reproducible Dice Scores #687