Closed drbeh closed 1 week ago
Hi @drbeh , I use my machine to run bundle inference on these cases, and got the results:
[{'spleen': 0.9641091227531433},
{'aorta': 0.9692426323890686},
{'liver': 0.9203331470489502, 'hepatic tumor': 0.7033719420433044},
{'lung tumor': 0.8624954223632812},
{'colon cancer primaries': 0.8006274104118347},
{'stomach': 0.932415246963501,
'inferior vena cava': 0.9008662700653076,
'pancreas': 0.7037838101387024,
'vertebrae L1': 0.9787881374359131,
'vertebrae T8': 0.9794131517410278,
'brain': 0.8486177921295166},
{'left rib 8': 0.9291888475418091,
'right rib 3': 0.9454008936882019,
'right rib 12': 0.9586374759674072,
'right iliopsoas': 0.8804903030395508,
'heart': 0.9503232836723328}]
It's the same as your benchmark data (difference from baseline data). In addition, the inference results are reproducible according to my multiple rounds of test.
the above data is produced by non-tensorrt model. For tensorrt inference, the results are also similar (and can reproduce):
{'aorta': 0.9691897630691528},
{'liver': 0.9203130006790161, 'hepatic tumor': 0.7031749486923218},
{'lung tumor': 0.8627061247825623},
{'colon cancer primaries': 0.8004928827285767},
{'stomach': 0.9326068758964539,
'inferior vena cava': 0.9012161493301392,
'pancreas': 0.7041643857955933,
'vertebrae L1': 0.9787870049476624,
'vertebrae T8': 0.9794606566429138,
'brain': 0.8487906455993652},
{'left rib 8': 0.9273821115493774,
'right rib 3': 0.9454008936882019,
'right rib 12': 0.9577922224998474,
'right iliopsoas': 0.8795918226242065,
'heart': 0.9503096342086792}]
**Describe the bug
We tried to benchmark VISTA-3D for accuracy (dice score), so we ran one locally to create baselines and another one in our CI pipeline to create benchmarks. However, we realized that we cannot reproduce these metrics and baseline and benchmark differ:
Here are the tests cases that we used:
and here are the test cases for speed:
Environment
The baseline and banchmak are being run on different machines but the same container.