AImageLab-zip / ToothFairy

Repository relative to the ToothFairy challenges, MICCAI2023 & MICCAI2024
MIT License
29 stars 7 forks source link

Low Prediction Metrics #9

Closed PengchengShi1220 closed 1 month ago

PengchengShi1220 commented 1 month ago

Dear Organizers,

I have encountered an unusual issue with my prediction metrics. When testing locally and using the Grand Challenge Algorithm platform, the outputs for Oral-pharyngeal segmentation files closely match the results I obtain. However, after submitting my results to the Preliminary Phase, the metrics are extremely low.

To investigate further, I used the evaluation script provided at evaluation.py to calculate Dice and Hausdorff Distance (HD) locally. The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091 (as seen here). image image

This discrepancy is not only limited to Dice coefficients but also affects HD and other cases. I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr. Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

Thank you for your assistance in resolving this matter.

Best, Pengcheng

LucaLumetti commented 1 month ago

Dear PengchengShi1220,

I'm sorry for not getting back to you sooner. We have thoroughly investigated the issue you're experiencing with the evaluation Docker.

The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091

Upon reviewing your algorithm output on the Grand Challenge platform, it appears significantly different from the screenshot you provided. This discrepancy aligns with the lower Dice score you observed on the Grand Challenge platform. You can view and download both the input images and your algorithm's output directly from the platform.

I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr

We have verified that the input images and ground truth labels used internally are indeed identical to those in imagesTr and labelsTr. We checked the SHA1 hashes of each image and label to ensure consistency.

Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

We have tested the evaluation script both locally and on the Grand Challenge platform using our example algorithm. The results were consistent across both environments, confirming that the evaluation script is functioning correctly.

So I think that the problem may be hidden in your algorithm code. If you have any further questions or need additional assistance, please let us know.

Best regards,
Luca Lumetti

PengchengShi1220 commented 1 month ago

Dear Luca Lumetti,

Thank you for your thorough investigation and detailed response.

Yes, I have downloaded and reviewed the file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. As shown in my screenshot, the prediction on the right side is exactly the same as the one in 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. image

image

Could you please help resolve the issue?

Best regards,
Pengcheng Shi

LucaLumetti commented 1 month ago

Dear @PengchengShi1220,

I can see that the names of your files do not correspond with the names I can see from the platform. On your algorithm results page, which are the predicted labels employed by the Docker evaluation algorithm and are generated upon submission of the algorithm, I see the following:

Upon downloading, the four different files are named, respectively:

I cannot find any file named 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. Could you please specify where you downloaded this file from?

Thank you, Luca

PengchengShi1220 commented 1 month ago

Dear Luca,

Thank you for your detailed message. The file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha was generated during the Try-out Algorithm phase using the same Docker setup. My input file was imagesTr/ToothFairy2F_065_0000.mha, and the output was the file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. I'm not sure if you can see this file, but it should correspond to the input mentioned above. image image

I downloaded:

image Yes, the four files you referred to are indeed my submission results. Among them, e139ac8b-3a1d-457a-9208-04f6ce2a72bc.mha is the output corresponding to ToothFairy2F_065_0000.mha. There is a significant difference between 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha and e139ac8b-3a1d-457a-9208-04f6ce2a72bc.mha, with Dice scores of 0.914 and 0.091 respectively, which indicates a potential issue.

I hope this clarifies the situation.

Best, Pengcheng

LucaLumetti commented 1 month ago

Dear Pengcheng, My guess is that you are doing the submission in a wrong way. Once you update your algorithm, you can test it through the algorithm page but every test of this kind has nothing to do with the challenge itself. This is visible because the creator is your account instead of Unknown (which means that our challenge created it during your submission).

Please, be sure that the algorithm docker image (or github repository) is update and perform again the resubmission from our challenge page: https://toothfairy2.grand-challenge.org/evaluation/preliminary-phase-release-of-training-data/submissions/create/

It would require some minutes, then we will see if the metrics have changed.

Best regards, Luca Lumetti

PengchengShi1220 commented 1 month ago

Dear Luca,

Thank you for your feedback and guidance.

I want to clarify that my algorithm was indeed the last one uploaded and activated, as shown in the images provided. The Algorithm ID and Algorithm version are both cb7450ed-e992-4b14-b1bc-61e177626003.

image image image

Best regards,
Pengcheng

PengchengShi1220 commented 1 month ago

Hi Luca,

I have noticed a significant discrepancy between my input data ToothFairy2F_065_0000.mha (989476a7-eea5-41c5-a8fa-bb348016807c.mha) and the test data ToothFairy2F_065_0000.mha (007fecaa-228a-439a-af7f-93ddfc53b0b2.mha), including differences in file size and grayscale range. I will re-download the latest version of the ToothFairy2 dataset and conduct further checks.

image

Best, Pengcheng

LucaLumetti commented 1 month ago

Hi @PengchengShi1220,

On 2024-06-13, we updated the F volumes. You can review the changes in the changelog on ditto. The new values have been projected into the HU scale through a simple linear transformation. This means you should be able to correct your data without needing to retrain your model.

If you missed this update, there might be other changes you're not aware of. Therefore, I recommend re-downloading the latest version of the dataset to ensure you have the most current information (and in this case, suggest a model retraining). Please note that the dataset will not undergo any further updates until the end of the challenge.

Luca

PengchengShi1220 commented 1 month ago

Hi Luca,

Thank you for your clarification and suggestions. I will download the latest version of the dataset and retrain the model before submitting another attempt.

Best,
Pengcheng

PengchengShi1220 commented 1 month ago

Hi Luca,

Thank you for your help. After downloading the latest version of the dataset and retraining the model, the issue has been resolved.

Best, Pengcheng