baeseongsu / mimic-cxr-vqa

A new collection of medical VQA dataset based on MIMIC-CXR. Part of the work 'EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images, NeurIPS 2023 D&B'.
MIT License
62 stars 3 forks source link

About the reference score #4

Open Eldo-rado opened 2 weeks ago

Eldo-rado commented 2 weeks ago

Hi @baeseongsu, apologies for bothering you again, but I have a few detailed questions that I hope you can help me with.

  1. What does "VQA grounding" refer to in section E.1.3?
  2. Could you explain in more detail how DINO is used as mentioned in E.1.3? Is the input anatomical regions? For example, is the input the bbox of the left lung, and the output the answer to questions related to the left lung? Does the training process involve anatomical region bboxes, or is it just based on MIMIC CXR JPG?
  3. Regarding the design of the reference score in Section 6.1, I want to clarify the motivation behind it. For example, with two questions like "Is there {attributeA} in {objectA}" and "Is there {attributeB} in {objectB}," is the reference score used because the answer to each sentence might not be entirely accurate? However, in Table E18, the performance of the ref model is often not as good as M3AE, so does it still serve as a useful guide in such cases?

Thank you in advance!

baeseongsu commented 2 weeks ago

Hi @Eldo-rado,

You had great questions :), and I would like to share my thoughts about them. Before getting into each detail, I would first like to overview the underlying motivation of section E.1.3, and then respond to your questions.

When creating a benchmark, it's essential to have reference scores (e.g., naive random guessing, human/expert performance score) that give meaning to the performance. Comparing these reference scores with your model's score can indicate how well your model is performing. For our MIMIC-CXR-VQA dataset, we built baselines such as Prior (Most) and Prior (Question), which are powerful in closed-set answer scenarios. Comparing these baseline scores with trained VQA models suggests that the trained models are not just random guessing and have learned to answer the questions to some degree.

We also wanted to estimate the achievable performance (i.e., upper bound) of our MIMIC-CXR-VQA dataset. We know that 100% accuracy cannot be the upper bound because even radiologists aren't perfect at CXR interpretation, so QA performance can't reach 100%. Obtaining a reliable upper bound score (i.e., medical expert score) for all test data through dozens of radiologists was not feasible for us. However, simply reporting performance without context lacks meaning.

Therefore, we aimed to estimate a reference model/score to show the potential performance ceiling for VQA models in the MIMIC-CXR-VQA dataset. To design this reference model, we leveraged the almost ground-truth grounding information (i.e., bounding box of object, binary label of attribute existence) from the Chest ImaGenome dataset. We built a reference model by designing an architecture with an inductive bias and providing grounding information directly during training. Note that VQA models are trained with (image, question, answer) triples, so they learn indirectly and without any inductive bias in the model architecture. Given fair experimental settings, this reference model might have more powerful perception ability compared to the VQA model due to its additional inductive bias.

To illustrate the significance of this reference model/score, consider this example: Given the same dataset, a VQA model trained on the VQA dataset achieves a performance of 0.7 when evaluating verification questions like "Is there lung cancer in the left lung?" across multiple samples. Our reference model, using the left lung as image input and employing a lung cancer prediction head for binary classification, achieves a performance of 0.8 on the same number of VQA samples. This 0.8 score serves as the reference score achievable with correct perception, suggesting that the current VQA model has room for improvement.

baeseongsu commented 2 weeks ago
  1. What does "VQA grounding" refer to in section E.1.3?


  1. Could you explain in more detail how DINO is used as mentioned in E.1.3? Is the input anatomical regions? For example, is the input the bbox of the left lung, and the output the answer to questions related to the left lung? Does the training process involve anatomical region bboxes, or is it just based on MIMIC CXR JPG?
image



  1. Regarding the design of the reference score in Section 6.1, I want to clarify the motivation behind it. For example, with two questions like "Is there {attributeA} in {objectA}" and "Is there {attributeB} in {objectB}," is the reference score used because the answer to each sentence might not be entirely accurate? However, in Table E18, the performance of the ref model is often not as good as M3AE, so does it still serve as a useful guide in such cases?
baeseongsu commented 2 weeks ago

@Eldo-rado,

If my claims are not incorrect or vague, please feel free to discuss them :) Thank you for asking.

Eldo-rado commented 2 weeks ago

Fully understood, thank you for your explanation! This is very helpful to me. ❤️

By the way, perhaps we can try using both the image and region as inputs for DINO. After all, DINO only used the full image during training, and providing only the region during testing may likely lead to a drop in performance. Additionally, when using DINO for inference, CTR and MTR might also need to be considered as mentioned in B 2.2 Question template construction. This is my rough idea; I wonder if it is correct. 😂

baeseongsu commented 2 weeks ago

Hi @Eldo-rado,

Thank you for sharing your ideas. I might not have tested using both the full image and targeted region as inputs for designing the reference model; that is a totally reasonable choice for boosting the reference score. Regarding DINO, during pre-training, the DINO model (i.e., DINO v1) not only used full images but also cropped images because they used a multi-crop augmentation technique. That's why we decided to use the DINO pre-training strategy and model as the backbone, as it was ready to adopt cropped images as inputs.

Regarding CTR and MTR, you are right. I did not consider the ratio features such as CTR and MTR, but these should be included in the reference score for completeness. For example, as you similarly mentioned, we could design a reference experiment with two inputs: one being the entire image, and the other being the same size but showing only the targeted region while otherwise being black. In fact, I've just realized this might be clearer in how we can inject the region information. By comparing the black area and the targeted area, DINO can directly understand which region is the target, rather than relying on our cropping method.

Eldo-rado commented 2 weeks ago

Thank you for your sharing. If possible, could you provide me with the DINO-related code and the trained weights? I would like to give it a try. We will cite EHRXQA in our work!

baeseongsu commented 2 weeks ago

@Eldo-rado,

I will check the DINO-related code and trained weights, and then let you know about them together. I am happy to share our work with you to help you develop better ideas.

Eldo-rado commented 2 weeks ago

@baeseongsu Thank you for your generosity; I have learned a lot.

baeseongsu commented 2 weeks ago

@Eldo-rado

Now, you can review the entire code we used for the reference experiment at the github repo link. This includes: (1) How to build a MIMIC-DINO (pre-training MIMIC-CXR-JPG images for the ViT backbone initialized with original DINO v1 weights); (2) How to fine-tune DINO to build a reference model.

We've also shared the pre-trained and fine-tuned weights for these models via dropbox. While the code might be a bit messy in places, you could use it to understand how to load and use the models.

Best, Seongsu

Eldo-rado commented 2 weeks ago

Thank you, I will try it. If I discover anything new, I will share it with you. ☺️

Eldo-rado commented 1 week ago

hi @baeseongsu

Regarding Appendix 2.2, I have a few questions to confirm😂: 1.In Image selection (b), the instruction "select one representative CXR image based on the earliest study datetime" — is this step intended to avoid having different images corresponding to the same report? 2.In Label refinement (b), the instruction to "propagate the presence of attributes from a child object to its parent object" — does this require additional action? I noticed that in the Chest ImaGenome Dataset, it seems this has already been done. 3.Similarly, in Label refinement (c), does "propagate the presence of a child attribute to its parent attribute" require additional steps? 4.In Label refinement (d), is it necessary to "exclude any relationships between an object and its associated attributes that are not allowed by the ontology"? I found in the Chest ImaGenome Dataset paper that they seem to mention, "Using a CXR ontology constructed by radiologists, a scene graph assembly pipeline corrected obvious attribute-to-anatomy assignment errors."

Thx!

baeseongsu commented 1 week ago

Hi @Eldo-rado,

1.In Image selection (b), the instruction "select one representative CXR image based on the earliest study datetime" — is this step intended to avoid having different images corresponding to the same report?

2.In Label refinement (b), the instruction to "propagate the presence of attributes from a child object to its parent object" — does this require additional action? I noticed that in the Chest ImaGenome Dataset, it seems this has already been done. 3.Similarly, in Label refinement (c), does "propagate the presence of a child attribute to its parent attribute" require additional steps? 4.In Label refinement (d), is it necessary to "exclude any relationships between an object and its associated attributes that are not allowed by the ontology"? I found in the Chest ImaGenome Dataset paper that they seem to mention, "Using a CXR ontology constructed by radiologists, a scene graph assembly pipeline corrected obvious attribute-to-anatomy assignment errors."

Best, Seongsu