BAAI-DCAI / M3D

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
MIT License
203 stars 10 forks source link

Divergence in Reported Recall Score for IR and TR Tests #22

Closed htong3031 closed 1 month ago

htong3031 commented 1 month ago

Dear Author,

I have encountered a significant discrepancy between the reported Recall scores for the IR and TR tests in the paper and the scores from my own run. For instance, the paper reports a IR R@10 score of 62.25 with 2000 test samples, but my run yields a much higher recall score, exceeding 95.

Specifically, I used recommended medical 3D ViT weight pretrained_ViT.bin from M3D-CLIP directly and observed the following results, which differ significantly from those reported in Table 2 of the paper:

Test samples 100 500 1000 2000
IR R@1 50.00 86.20 82.20 78.55
IR R@5 84.00 96.80 95.10 93.20
IR R@10 95.00 97.80 97.20 95.75
TR R@1 48.00 86.20 81.70 77.95
TR R@5 89.00 96.40 94.60 93.40
TR R@10 94.00 97.40 96.90 96.25

Upon inspecting the evaluation script for ITR, I noticed that it references a JSON file located at ./Data/data/M3D_Cap_npy/M3D_Cap_eh.json. I assumed this was the same as M3D_Cap.json, as I could not locate the exact file M3D_Cap_eh.json. I suspect this may be contributing to the divergence in the recall scores. Could you please clarify if M3D_Cap_eh.json is a different evaluation file? If so, could you provide guidance on where I can find this file?

Additionally, I would appreciate some clarification regarding the weights used in pretrained_ViT.bin. Are these weights the result of pretraining the 3D Image Encoder, or have they been updated during a fine-tuning process? I have also trained the model from scratch, and the results continue to show significant discrepancies:

Test samples 100 500 1000 2000
IR R@1 36.00 77.40 66.60 57.85
IR R@5 81.00 93.80 89.40 85.75
IR R@10 91.00 96.40 94.10 91.60
TR R@1 38.00 74.00 64.30 56.30
TR R@5 79.00 94.40 88.40 83.90
TR R@10 93.00 96.60 94.00 91.05

Thank you for your time and assistance.

Best regards,

baifanxxx commented 1 month ago

Hi,

In fact, the 3D ViT weight in my HF is a better model I got using a larger batch size for researchers to use. The experimental results and models in the paper have not been updated. We'll continue to update the paper after that, and in the meantime, you can use our stronger model.

M3D_Cap_eh.json is the same as the M3D_Cap.json. You can use the M3D_Cap.json directly.

'pretrained_ViT.bin' is only trained by M3D-Cap in the pretraining. The weights we provide are not updated by MLLM training.

As for the model's performance, if we use our latest large batch size setting, we will get better results. Your results don't look bad either, but you need a larger batch size and longer training. I still recommend using the latest weights we provide.

If you have any other questions, please feel free to contact me; I have been a little busy recently, so please understand that the reply is not timely.