About the reference score

Eldo-rado commented 3 months ago

Hi @baeseongsu, apologies for bothering you again, but I have a few detailed questions that I hope you can help me with.

What does "VQA grounding" refer to in section E.1.3?
Could you explain in more detail how DINO is used as mentioned in E.1.3? Is the input anatomical regions? For example, is the input the bbox of the left lung, and the output the answer to questions related to the left lung? Does the training process involve anatomical region bboxes, or is it just based on MIMIC CXR JPG?
Regarding the design of the reference score in Section 6.1, I want to clarify the motivation behind it. For example, with two questions like "Is there {attributeA} in {objectA}" and "Is there {attributeB} in {objectB}," is the reference score used because the answer to each sentence might not be entirely accurate? However, in Table E18, the performance of the ref model is often not as good as M3AE, so does it still serve as a useful guide in such cases?

Thank you in advance!

baeseongsu commented 3 months ago

Hi @Eldo-rado,

You had great questions :), and I would like to share my thoughts about them. Before getting into each detail, I would first like to overview the underlying motivation of section E.1.3, and then respond to your questions.

When creating a benchmark, it's essential to have reference scores (e.g., naive random guessing, human/expert performance score) that give meaning to the performance. Comparing these reference scores with your model's score can indicate how well your model is performing. For our MIMIC-CXR-VQA dataset, we built baselines such as Prior (Most) and Prior (Question), which are powerful in closed-set answer scenarios. Comparing these baseline scores with trained VQA models suggests that the trained models are not just random guessing and have learned to answer the questions to some degree.

We also wanted to estimate the achievable performance (i.e., upper bound) of our MIMIC-CXR-VQA dataset. We know that 100% accuracy cannot be the upper bound because even radiologists aren't perfect at CXR interpretation, so QA performance can't reach 100%. Obtaining a reliable upper bound score (i.e., medical expert score) for all test data through dozens of radiologists was not feasible for us. However, simply reporting performance without context lacks meaning.

Therefore, we aimed to estimate a reference model/score to show the potential performance ceiling for VQA models in the MIMIC-CXR-VQA dataset. To design this reference model, we leveraged the almost ground-truth grounding information (i.e., bounding box of object, binary label of attribute existence) from the Chest ImaGenome dataset. We built a reference model by designing an architecture with an inductive bias and providing grounding information directly during training. Note that VQA models are trained with (image, question, answer) triples, so they learn indirectly and without any inductive bias in the model architecture. Given fair experimental settings, this reference model might have more powerful perception ability compared to the VQA model due to its additional inductive bias.

To illustrate the significance of this reference model/score, consider this example: Given the same dataset, a VQA model trained on the VQA dataset achieves a performance of 0.7 when evaluating verification questions like "Is there lung cancer in the left lung?" across multiple samples. Our reference model, using the left lung as image input and employing a lung cancer prediction head for binary classification, achieves a performance of 0.8 on the same number of VQA samples. This 0.8 score serves as the reference score achievable with correct perception, suggesting that the current VQA model has room for improvement.

baeseongsu commented 3 months ago

What does "VQA grounding" refer to in section E.1.3?

As mentioned in the middle of Section 6.1, we realized that the core perception of VQA models hinges on the question template "Is there {attribute} in {object}?". When a VQA model has good core perception skills, it can solve more complex questions by employing a combination of logical operations. We broadly term this core perception skill of VQA models as "VQA grounding". I acknowledge this may be vague terminology without concrete context; apologies for any confusion.

Could you explain in more detail how DINO is used as mentioned in E.1.3? Is the input anatomical regions? For example, is the input the bbox of the left lung, and the output the answer to questions related to the left lung? Does the training process involve anatomical region bboxes, or is it just based on MIMIC CXR JPG?

Let me illustrate with an example: For a verify sample (image=X, question="Is there lung opacity in right lower lung zone?", answer="no"), this is transformed into the format (input="right lower lung zone (RLLZ) part of X", classification head="lung opacity", answer="absent(0)"). The reference model then takes the bounding RLLZ box as input and classifies binary labels on the lung opacity classification head. For the training procedure, we incorporated anatomical region bounding boxes. I added a diagram comparing the reference model and VQA model with the same sample for clarity.

Regarding the design of the reference score in Section 6.1, I want to clarify the motivation behind it. For example, with two questions like "Is there {attributeA} in {objectA}" and "Is there {attributeB} in {objectB}," is the reference score used because the answer to each sentence might not be entirely accurate? However, in Table E18, the performance of the ref model is often not as good as M3AE, so does it still serve as a useful guide in such cases?

Regarding your findings about Table E18, where the performance of the reference model is often not as good as M3AE, this is indeed an important observation. We also noticed that our reference model doesn't always achieve the best score compared to other VQA models across all (object, attribute) pairs. This is why we decided to call our reference model an "achievable score" rather than an "upper bound score".
There are several reasons for this: (a) Vision-language pretrained (VLP) models like MedViLL and M3AE, while not receiving direct grounding information, can learn implicit correlations between images and questions during training. For example, anatomical finding A1 in region O1 might relate to anatomical finding A2 in region O2, as most physical findings in our body are not entirely independent; (b) VQA models can see the entire image, while the reference model only sees the targeted area (though it's a clearly informative region for classifying the existence of attributes).
Therefore, we conclude that our reference score can act as an achievable performance metric, indicating where VQA models can potentially improve, not explicit upper bound score. However, in cases where VQA models outperform the reference score, as you mentioned, we can't draw definitive conclusions about the VQA model's performance based on the reference score alone. This limitation is part of why we consider it an "achievable" rather than an "upper bound" score.

baeseongsu commented 3 months ago

@Eldo-rado,

If my claims are not incorrect or vague, please feel free to discuss them :) Thank you for asking.

Eldo-rado commented 3 months ago

Fully understood, thank you for your explanation! This is very helpful to me. ❤️

By the way, perhaps we can try using both the image and region as inputs for DINO. After all, DINO only used the full image during training, and providing only the region during testing may likely lead to a drop in performance. Additionally, when using DINO for inference, CTR and MTR might also need to be considered as mentioned in B 2.2 Question template construction. This is my rough idea; I wonder if it is correct. 😂

baeseongsu commented 3 months ago

Hi @Eldo-rado,

Thank you for sharing your ideas. I might not have tested using both the full image and targeted region as inputs for designing the reference model; that is a totally reasonable choice for boosting the reference score. Regarding DINO, during pre-training, the DINO model (i.e., DINO v1) not only used full images but also cropped images because they used a multi-crop augmentation technique. That's why we decided to use the DINO pre-training strategy and model as the backbone, as it was ready to adopt cropped images as inputs.

Regarding CTR and MTR, you are right. I did not consider the ratio features such as CTR and MTR, but these should be included in the reference score for completeness. For example, as you similarly mentioned, we could design a reference experiment with two inputs: one being the entire image, and the other being the same size but showing only the targeted region while otherwise being black. In fact, I've just realized this might be clearer in how we can inject the region information. By comparing the black area and the targeted area, DINO can directly understand which region is the target, rather than relying on our cropping method.

Eldo-rado commented 3 months ago

Thank you for your sharing. If possible, could you provide me with the DINO-related code and the trained weights? I would like to give it a try. We will cite EHRXQA in our work!

baeseongsu commented 3 months ago

@Eldo-rado,

I will check the DINO-related code and trained weights, and then let you know about them together. I am happy to share our work with you to help you develop better ideas.

Eldo-rado commented 3 months ago

@baeseongsu Thank you for your generosity; I have learned a lot.

baeseongsu commented 3 months ago

@Eldo-rado

Now, you can review the entire code we used for the reference experiment at the github repo link. This includes: (1) How to build a MIMIC-DINO (pre-training MIMIC-CXR-JPG images for the ViT backbone initialized with original DINO v1 weights); (2) How to fine-tune DINO to build a reference model.

We've also shared the pre-trained and fine-tuned weights for these models via dropbox. While the code might be a bit messy in places, you could use it to understand how to load and use the models.

Best, Seongsu

Eldo-rado commented 3 months ago

Thank you, I will try it. If I discover anything new, I will share it with you. ☺️

Eldo-rado commented 2 months ago

hi @baeseongsu

Regarding Appendix 2.2, I have a few questions to confirm😂: 1.In Image selection (b), the instruction "select one representative CXR image based on the earliest study datetime" — is this step intended to avoid having different images corresponding to the same report? 2.In Label refinement (b), the instruction to "propagate the presence of attributes from a child object to its parent object" — does this require additional action? I noticed that in the Chest ImaGenome Dataset, it seems this has already been done. 3.Similarly, in Label refinement (c), does "propagate the presence of a child attribute to its parent attribute" require additional steps? 4.In Label refinement (d), is it necessary to "exclude any relationships between an object and its associated attributes that are not allowed by the ontology"? I found in the Chest ImaGenome Dataset paper that they seem to mention, "Using a CXR ontology constructed by radiologists, a scene graph assembly pipeline corrected obvious attribute-to-anatomy assignment errors."

Thx！

baeseongsu commented 2 months ago

Hi @Eldo-rado,

1.In Image selection (b), the instruction "select one representative CXR image based on the earliest study datetime" — is this step intended to avoid having different images corresponding to the same report?

For simplicity, we use one representative CXR image, but you can consider all different CXR images corresponding to the same report if you prefer.

2.In Label refinement (b), the instruction to "propagate the presence of attributes from a child object to its parent object" — does this require additional action? I noticed that in the Chest ImaGenome Dataset, it seems this has already been done. 3.Similarly, in Label refinement (c), does "propagate the presence of a child attribute to its parent attribute" require additional steps? 4.In Label refinement (d), is it necessary to "exclude any relationships between an object and its associated attributes that are not allowed by the ontology"? I found in the Chest ImaGenome Dataset paper that they seem to mention, "Using a CXR ontology constructed by radiologists, a scene graph assembly pipeline corrected obvious attribute-to-anatomy assignment errors."

Regarding Q2, Q3, and Q4: Yes, we did post-processing for those parts. Although Chest ImaGenome has well-defined ontologies including parent-child relationships between objects and attributes, we refined these relationships further under the guidance of medical experts for better ontologies. However, I guess our further label refinement is not crucial, so you can ignore it if you want. But if you want to look deeper into that, I recommend referring to my code about postprocessing labels in the current GitHub repository. Here is the link!

Best, Seongsu

baeseongsu / mimic-cxr-vqa

About the reference score #4