Results difference of compared method

Markin-Wang commented 2 months ago

Hi, thank you for your work and sharing the code. I found that the resultls of compared method, especially the SIIM segmenation part, are very different from the results in the original paper, e.g, MGCA and LoVT. Even the 100% setting (without influence of the different data split) is very different. Could you kindly provide some possible reasons behind? For example, different finetuning protocal or?

QtacierP commented 2 months ago

Thank you for your careful comparison and suggestions. In fact, there are several differences between our work (PRIOR) and the previous work, particularly in the SIIM segmentation task:

For the SIIM segmentation task, we exclusively use pneumothorax samples (a total of 2669) with a five-fold cross-validation approach. In contrast, previous work sampled the train/validation/test sets using a 7:3:3 ratio with ALL images.
The pre-training data also differs slightly. We exclude short reports containing fewer than four sentences, resulting in 182,475 image-report pairs. However, the previous work utilized all lateral-view images.
For classification, we employ a different split ratio of 6:2:2 and conduct 5 runs with different random seeds.
In the detection task, we focus solely on pneumonia samples (a total of 6012) using five-fold cross-validation. In previous work, the ALL training set was randomly split into 16,010/5,337/5,337 for training/validation/testing.
Both the segmentation and detection tasks feature different segmentation and detection heads (architectures). We use the UNet from the Segmentation Models PyTorch (SMP) library for segmentation and Faster R-CNN from TorchVision for detection, which differs from the architectures used in previous work.

For Points 1 and 4, the negative samples have no ROI areas, which would significantly improve the Dice/mIoU scores (as the network only needs to predict nothing). Therefore, we use only the positive samples for training.

I hope this information could be helpful for you :)

Markin-Wang commented 2 months ago

Thank you for your careful comparison and suggestions. In fact, there are several differences between our work (PRIOR) and the previous work, particularly in the SIIM segmentation task:

For the SIIM segmentation task, we exclusively use pneumothorax samples (a total of 2669) with a five-fold cross-validation approach. In contrast, previous work sampled the train/validation/test sets using a 7:3:3 ratio with ALL images.

The pre-training data also differs slightly. We exclude short reports containing fewer than four sentences, resulting in 182,475 image-report pairs. However, the previous work utilized all lateral-view images.

For classification, we employ a different split ratio of 6:2:2 and conduct 5 runs with different random seeds.

In the detection task, we focus solely on pneumonia samples (a total of 6012) using five-fold cross-validation. In previous work, the ALL training set was randomly split into 16,010/5,337/5,337 for training/validation/testing.

Both the segmentation and detection tasks feature different segmentation and detection heads (architectures). We use the UNet from the Segmentation Models PyTorch (SMP) library for segmentation and Faster R-CNN from TorchVision for detection, which differs from the architectures used in previous work.

For Points 1 and 4, the negative samples have no ROI areas, which would significantly improve the Dice/mIoU scores (as the network only needs to predict nothing). Therefore, we use only the positive samples for training.

I hope this information could be helpful for you :)

Hi, thank you for your reply. Previous works like MGCA also only use the front view samples in the pretraining. I guess the main reason is the five-fold validation. BTW, do you have any plan to release the finetuning code on downstream tasks?

Best Regards, Jun

QtacierP commented 2 months ago

Currently, we may not release the fine-tuning codes, as this part of our project is quite extensive. However, the main implementation is relatively straightforward, and you can find all hyperparameters and details in the supplementary material.

We are working on PRIOR v2, which will feature more refined fine-tuning protocols. Once it is published, we will include the entire pipeline, from pre-training to evaluation.

QtacierP / PRIOR

Results difference of compared method #12