Low Prediction Metrics - Githubissues

PengchengShi1220 commented 1 month ago

Dear Organizers,

I have encountered an unusual issue with my prediction metrics. When testing locally and using the Grand Challenge Algorithm platform, the outputs for Oral-pharyngeal segmentation files closely match the results I obtain. However, after submitting my results to the Preliminary Phase, the metrics are extremely low.

To investigate further, I used the evaluation script provided at evaluation.py to calculate Dice and Hausdorff Distance (HD) locally. The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091 (as seen here).

This discrepancy is not only limited to Dice coefficients but also affects HD and other cases. I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr. Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

Thank you for your assistance in resolving this matter.

Best, Pengcheng

LucaLumetti commented 1 month ago

Dear PengchengShi1220,

I'm sorry for not getting back to you sooner. We have thoroughly investigated the issue you're experiencing with the evaluation Docker.

The results were significantly different from those shown on the Grand Challenge platform. For example, for ToothFairy2F_065_0000.mha (where the prediction appears consistent with the ground truth in ITK-SNAP 3D visualization), the average Dice coefficient I calculated locally is 0.914. However, the platform displays a Dice coefficient of 0.091

Upon reviewing your algorithm output on the Grand Challenge platform, it appears significantly different from the screenshot you provided. This discrepancy aligns with the lower Dice score you observed on the Grand Challenge platform. You can view and download both the input images and your algorithm's output directly from the platform.

I would like to inquire if the input images and ground truth labels you use internally are the same as those in imagesTr and labelsTr

We have verified that the input images and ground truth labels used internally are indeed identical to those in imagesTr and labelsTr. We checked the SHA1 hashes of each image and label to ensure consistency.

Have you performed similar tests to validate the consistency of the evaluation script? Could this issue be related to discrepancies in the evaluation script or the Grand Challenge platform itself?

We have tested the evaluation script both locally and on the Grand Challenge platform using our example algorithm. The results were consistent across both environments, confirming that the evaluation script is functioning correctly.

So I think that the problem may be hidden in your algorithm code. If you have any further questions or need additional assistance, please let us know.

Best regards,
Luca Lumetti

PengchengShi1220 commented 1 month ago

Dear Luca Lumetti,

Thank you for your thorough investigation and detailed response.

Yes, I have downloaded and reviewed the file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. As shown in my screenshot, the prediction on the right side is exactly the same as the one in 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha.

Could you please help resolve the issue?

Best regards,
Pengcheng Shi

LucaLumetti commented 1 month ago

Dear @PengchengShi1220,

I can see that the names of your files do not correspond with the names I can see from the platform. On your algorithm results page, which are the predicted labels employed by the Docker evaluation algorithm and are generated upon submission of the algorithm, I see the following:

Oral-pharyngeal segmentation: 007fecaa-228a-439a-af7f-93ddfc5... (.mha)
Oral-pharyngeal segmentation: 695618c7-2ec5-4f10-acd5-0dcb52c... (.mha)
Oral-pharyngeal segmentation: 26611819-d5a8-43bd-bfd6-7612295... (.mha)
Oral-pharyngeal segmentation: 052ce942-40dd-4a15-95a5-cc632c4... (.mha)

Upon downloading, the four different files are named, respectively:

e139ac8b-3a1d-457a-9208-04f6ce2a72bc.mha
22af04a6-f6e9-4732-a3ca-b2c02b3040c3.mha
e735b680-afe6-4f11-98ba-76f75c687e2d.mha
2fc4fee9-696e-45dc-9937-75ea811610cc.mha

I cannot find any file named 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. Could you please specify where you downloaded this file from?

Thank you, Luca

PengchengShi1220 commented 1 month ago

Dear Luca,

Thank you for your detailed message. The file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha was generated during the Try-out Algorithm phase using the same Docker setup. My input file was imagesTr/ToothFairy2F_065_0000.mha, and the output was the file 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha. I'm not sure if you can see this file, but it should correspond to the input mentioned above.

I downloaded:

Oral-pharyngeal segmentation: 989476a7-eea5-41c5-a8fa-bb34801... (.mha) which resulted in 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha
Oral-pharyngeal segmentation: 3cfee360-92af-4e38-8aca-a15c399... (.mha) which resulted in 4a1374e5-09e3-4604-b364-9a5f0f9d78c1.mha

Yes, the four files you referred to are indeed my submission results. Among them, e139ac8b-3a1d-457a-9208-04f6ce2a72bc.mha is the output corresponding to ToothFairy2F_065_0000.mha. There is a significant difference between 69780988-6d7e-4bb9-8dbe-6865c9132a26.mha and e139ac8b-3a1d-457a-9208-04f6ce2a72bc.mha, with Dice scores of 0.914 and 0.091 respectively, which indicates a potential issue.

I hope this clarifies the situation.

Best, Pengcheng

LucaLumetti commented 1 month ago

Dear Pengcheng, My guess is that you are doing the submission in a wrong way. Once you update your algorithm, you can test it through the algorithm page but every test of this kind has nothing to do with the challenge itself. This is visible because the creator is your account instead of Unknown (which means that our challenge created it during your submission).

Please, be sure that the algorithm docker image (or github repository) is update and perform again the resubmission from our challenge page: https://toothfairy2.grand-challenge.org/evaluation/preliminary-phase-release-of-training-data/submissions/create/

It would require some minutes, then we will see if the metrics have changed.

Best regards, Luca Lumetti

PengchengShi1220 commented 1 month ago

Dear Luca,

Thank you for your feedback and guidance.

I want to clarify that my algorithm was indeed the last one uploaded and activated, as shown in the images provided. The Algorithm ID and Algorithm version are both cb7450ed-e992-4b14-b1bc-61e177626003.

Best regards,
Pengcheng

PengchengShi1220 commented 1 month ago

Hi Luca,

I have noticed a significant discrepancy between my input data ToothFairy2F_065_0000.mha (989476a7-eea5-41c5-a8fa-bb348016807c.mha) and the test data ToothFairy2F_065_0000.mha (007fecaa-228a-439a-af7f-93ddfc53b0b2.mha), including differences in file size and grayscale range. I will re-download the latest version of the ToothFairy2 dataset and conduct further checks.

Best, Pengcheng

LucaLumetti commented 1 month ago

Hi @PengchengShi1220,

On 2024-06-13, we updated the F volumes. You can review the changes in the changelog on ditto. The new values have been projected into the HU scale through a simple linear transformation. This means you should be able to correct your data without needing to retrain your model.

If you missed this update, there might be other changes you're not aware of. Therefore, I recommend re-downloading the latest version of the dataset to ensure you have the most current information (and in this case, suggest a model retraining). Please note that the dataset will not undergo any further updates until the end of the challenge.

Luca

PengchengShi1220 commented 1 month ago

Hi Luca,

Thank you for your clarification and suggestions. I will download the latest version of the dataset and retrain the model before submitting another attempt.

Best,
Pengcheng

PengchengShi1220 commented 1 month ago

Hi Luca,

Thank you for your help. After downloading the latest version of the dataset and retraining the model, the issue has been resolved.

Best, Pengcheng

AImageLab-zip / ToothFairy

Low Prediction Metrics #9