reproducibility - Githubissues

Hi, I download the full version cross-site prostate dataset (HK and BIDMC) provided in README, then run python main.py, and the testing results are:

Training Done! Start testing
=> Loading checkpoint 'log/1109/prostate_8bc_lr1e-3_UNetODADA_3step121_GC_C/2018/folder3/UNet_DA_loss_MedT/best_score_2018_checkpoint.pth.tar'
=> Loaded saved the best model at (epoch 315)
100%|██████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 26.47it/s]
avg_surf_dist_3D: (0.36792447706956555, 0.2899364095854785)
hd_dist_95_3D: 2.0
surface_overlap_3D: (0.9138823276712713, 0.9249319974762614)
surface_dice_3D: 0.919245792444278
volume_dice_3D: 0.889567523506653
The mean asd_2D:  2.6425; The ads_2D std:  1.0628
The  mean dice:  0.9230; The  dice std:  0.0545
The  mean IoU:  0.7537; The  IoU std:  0.1504
The  mean ACC:  0.9913; The  ACC std:  0.0038
The  mean sensitive:  0.9056; The  sensitive std:  0.0592
The  mean specificy:  0.9947; The  specificy std:  0.0043
The  mean precision:  0.8234; The  precision std:  0.1692
The  mean f1_score:  0.8506; The  f1_score std:  0.1080
The  mean Jaccard_M:  0.7542; The  Jaccard_M std:  0.1504
The  mean Jaccard_N:  0.9910; The  Jaccard_N std:  0.0039
The  mean Jaccard:  0.7542; The  Jaccard std:  0.1504
The  mean dc:  0.8506; The  dc std:  0.1080
The inference time:  0.2975
Number of trainable parameters 60568132 in Model UNet_DA
100%|██████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 35.44it/s]
avg_surf_dist_3D: (0.8326537194762975, 0.7805250157971086)
hd_dist_95_3D: 3.1622776601683795
surface_overlap_3D: (0.7670339066074607, 0.7540540050210169)
surface_dice_3D: 0.7605979811923145
volume_dice_3D: 0.8009900649730138
The mean asd_2D:  3.8848; The ads_2D std:  1.3398
The  mean dice:  0.8838; The  dice std:  0.0518
The  mean IoU:  0.6407; The  IoU std:  0.1330
The  mean ACC:  0.9895; The  ACC std:  0.0032
The  mean sensitive:  0.8097; The  sensitive std:  0.1096
The  mean specificy:  0.9933; The  specificy std:  0.0029
The  mean precision:  0.7475; The  precision std:  0.1224
The  mean f1_score:  0.7730; The  f1_score std:  0.1025
The  mean Jaccard_M:  0.6411; The  Jaccard_M std:  0.1330
The  mean Jaccard_N:  0.9893; The  Jaccard_N std:  0.0032
The  mean Jaccard:  0.6411; The  Jaccard std:  0.1330
The  mean dc:  0.7730; The  dc std:  0.1025
The inference time:  0.1452
Number of trainable parameters 60568132 in Model UNet_DA
Testing Done!

The upper part is for test dataset a, i.e. BIDMC, while the lower part is test dataset b, i.e. HK. The lower part doesn't fully match the results reported in table 1 in the paper. For example:

sensivitity (0.8097 in output v.s. 87.43 in paper table 1)
HD (3.1622 v.s. 7.78)

YonghengSun1997 / ODADA

reproducibility #4