About evaluation details

clannadcl commented 1 year ago

Hi, this is a awesome work. But I am confused about the evaluation of the Synaspse during training.

Synaspe contains eight foreground classes, but why it shows 13 classes and evaluates on 13 classes during training. Besides, is the val_eval_criterion_MA the DICE metric?

Amshaker commented 1 year ago

Hi @clannadcl,

Thank you for your interest in our work.

Synapse and BTCV both are multi-organ segmentation tasks, with the only difference in splitting the data and the number of considered organs. BTCV contains 13 classes, while Synapse contains only 8 classes.

The validation shows the 13 classes but we only consider the eight classes of Synapse. Specifically, we report the model performance using Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95) on 8 abdominal organs: spleen, right kidney, left kidney, gallbladder, liver, stomach, aorta, and pancreas.

I hope it is clear now.

Best regards, Abdelrahman.

clannadcl commented 1 year ago

Hi @clannadcl,

Thank you for your interest in our work.

Synapse and BTCV both are multi-organ segmentation tasks, with the only difference in splitting the data and the number of considered organs. BTCV contains 13 classes, while Synapse contains only 8 classes.

The validation shows the 13 classes but we only consider the eight classes of Synapse. Specifically, we report the model performance using Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95) on 8 abdominal organs: spleen, right kidney, left kidney, gallbladder, liver, stomach, aorta, and pancreas.

I hope it is clear now.

Best regards, Abdelrahman.

Thanks！ But what does the val_eval_criterion_MA mean during training?

AbdelrahmanShakerYousef commented 1 year ago

1) This is the moving average of the validation criterion over the 8 classes of Synapse, it is an estimation of the mean validation criterion.

2) One more important point, during training, the DSC is estimated for each patch separately and then the average of all patches is computed. This schema is not recommended for the final evaluation as we should compute the DSC based on the whole input size (512x512xN) not for each patch (128x128x64) separately. So you may find the numbers are higher for the evaluation during training and they decrease in the validation because of this reason.

clannadcl commented 1 year ago

This is the moving average of the validation criterion over the 8 classes of Synapse, it is an estimation of the mean validation criterion.

One more important point, during training, the DSC is estimated for each patch separately and then the average of all patches is computed. This schema is not recommended for the final evaluation as we should compute the DSC based on the whole input size (512x512xN) not for each patch (128x128x64) separately. So you may find the numbers are higher for the evaluation during training and they decrease in the validation because of this reason.

Thanks very much for your kindly response! I have a last question. During evaluation, we evaluate the DSC on the 13 classes, and show an overall value. I wonder if traing with 13 classes will affect the performance on the required 8 classes. Are the results shown in the paper trained on 13 classes? If not, how can I change the setting to train only with 8 classes?

AbdelrahmanShakerYousef commented 1 year ago

As I mentioned to you, Synapse and BTCV are both multi-organ segmentation tasks with the same dataset, the only difference is in splitting the data and the number of considered organs during evaluation. To be consistent with the codebase of nnFormer, we follow the same training and evaluation criteria. Synapse and BTCV are trained on 13 classes, the differences are: (1) the data split in Synapse is (18 samples training, 12 samples testing), while in BTCV you train on (24 samples, internal validation on 6 samples, and submit your prediction on the server to get the Testing set results). (2) The DSC is computed for Synapse for only 8 Organs, but for BTCV it is computed based on the 13 organs. The results in table 1 in the paper are based on these criteria. All methods are trained and evaluated using the same data split, same training, and evaluation schema.

I hope it is clear now.

clannadcl commented 1 year ago

As I mentioned to you, Synapse and BTCV are both multi-organ segmentation tasks with the same dataset, the only difference is in splitting the data and the number of considered organs during evaluation. To be consistent with the codebase of nnFormer, we follow the same training and evaluation criteria. Synapse and BTCV are trained on 13 classes, the differences are: (1) the data split in Synapse is (18 samples training, 12 samples testing), while in BTCV you train on (24 samples, internal validation on 6 samples, and submit your prediction on the server to get the Testing set results). (2) The DSC is computed for Synapse for only 8 Organs, but for BTCV it is computed based on the 13 organs. The results in table 1 in the paper are based on these criteria. All methods are trained and evaluated using the same data split, same training, and evaluation schema.

I hope it is clear now.

Got it. Thanks again.

Amshaker / unetr_plus_plus

About evaluation details #8