Open zihui-debug opened 1 month ago
Hi @zihui-debug,
Sorry for the late reply. Thank you for asking about the sampling strategy when constructing the MIMIC-CXR-VQA dataset. Initially, we struggled to setup the sampling strategies, as you mentioned. We faced several questions such as "How many (image, question, answer) triples do we need?" and "For each template, do we need to sample all objects and attributes equally, or should we let them be sampled randomly?". After much consideration, I can briefly summarize our VQA sampling strategy (similar to the "VQA dataset generation - dataset balancing" part in Appendix B.2.2 section of our paper) as follows:
To avoid language biases within the VQA dataset, we maximized the answer entropy during sampling. For verifying questions (e.g., "Is there {attribute=lung cancer} in the {object=left lung}?"), we balanced the answers towards a yes:no ratio of 1:1. When sampling candidate images, we drew from image pools corresponding to either positive (question, yes) or negative (question, no) answers. Similarly, for choosing questions (e.g., "Which anatomical location is related to {attribute=lung cancer}, the {object_1=left lung} or the {object_2=right lung}?"), there are four possible options for the answer (i.e., both, option 1, option 2, none). Thus, we maximized the answer entropy towards a ratio of 1:1:1:1.
We also considered the number of questions per image to ensure a variety of images. Initially, we limited each question template to one use per CXR image. This means an image could have a minimum of 0 questions and a maximum of 48 (the total number of our templates). We implemented a global image counter (i.e., tracking how many times each image was used across different types of templates) to increase the probability of selecting less frequently sampled images, thereby promoting greater diversity in our image set.
Lastly, we ensured a balanced number of sampled questions per template to maintain uniformity. This approach prevents any particular template from being over- or under-sampled, leading to a fair and diverse question dataset. Since our main concern was the evaluation of medical VLM models, we avoided random sampling and instead carefully considered this balance.
Specifically, to address your questions, we aimed to use as many possible question combinations as we could (i.e., all combinations of placeholders for each template). In other words, while not all combinations were fully utilized, most were considered. For the negative option, we sampled images to maximize the answer entropy.
If you have any detailed questions or would like to see the sampling code, I can share the raw-level code with you, but please note that it's a little bit messy.
Best, Seongsu
@zihui-debug Here is an addendum:
To ensure a balanced frequency of different objects and attributes, I almost certainly sampled a similar number of diverse questions within each template. In other words, I sampled different combinations of placeholders with roughly the same frequency (though I'm not 100% certain). However, due to the long-tailed distribution of X-ray findings itself, my algorithm cannot guarantee sampling all combinations equally. As a result, some combinations fail (e.g., there are no cases with both object1 and attribute1 to sample for a specific question template, especially in the gold dataset), and some are not frequent (e.g., there are frequent cases (i.e., related attributes) about the left lung but few cases about the trachea).
@zihui-debug Here is an addendum:
To ensure a balanced frequency of different objects and attributes, I almost certainly sampled a similar number of diverse questions within each template. In other words, I sampled different combinations of placeholders with roughly the same frequency (though I'm not 100% certain). However, due to the long-tailed distribution of X-ray findings itself, my algorithm cannot guarantee sampling all combinations equally. As a result, some combinations fail (e.g., there are no cases with both object1 and attribute1 to sample for a specific question template, especially in the gold dataset), and some are not frequent (e.g., there are frequent cases (i.e., related attributes) about the left lung but few cases about the trachea).
Thank you very much for your reply! Can I understand the whole sampling process like this:
Some supplementary questions:
Maybe I misunderstood and some of the questions seemed elementary...Thanks for your meticulous answers about this great work! If convenient, could you please send the raw-level code to my email address: 2192325557@qq.com, which will be very helpful to me.
@baeseongsu Moreover, I see that the data doesn't use attributes of the nlp and texture types, what's the consideration?
@zihui-debug Yes, that's exactly what we've done. Thank you for the clear summarization.
Regarding the supplementary questions, my response will be as follows:
Q1: Do placeholder value combinations refer to the 563 object-attribute relationships in Chest Imagenome?
A1: Not exactly. The combinations for placeholder values depend on the template used for sampling. For example: (1) If the template contains only one {object} placeholder, the possible combinations would be all objects (i.e., less than 40 possible combinations); (2) If the template contains multiple placeholders (e.g., {attribute1}, {attribute2}, and {object}), there could be over 1,000 possible combinations to sample. Note that we consider all possible combinations during sampling. However, some combinations may not be sampled if corresponding studies don't exist in the MIMIC-CXR cases.
Q2: Does a negative relationship (i.e., a negative option in choose and a 'no' answer in verify) mean that the Chest Imagenome attribute relationship is explicitly marked as no, or does anything that is not marked as yes count?
A2: This is an important question. We assume that if there's no information in the report, especially regarding the five categories we've covered, we regard it as "no." This is because radiology reports should be comprehensive; incompleteness could lead to patient care issues. This is why we do further preprocessing on the original Chest Imagenome dataset.
Q3: Is the answer distribution balanced only for choose and verify types?
A3: No, we did not balance the answer distribution for query types. It is more complicated to construct such logic, but the main scheme is the same as others, sampling answers that can maximize the answer entropies.
Also, I will share the code asap.
@zihui-debug For the 'nlp' category, we've replaced its concept with our own "abnormality" concept by pre-defining it as a superset of the other four categories (i.e., 'anatomicalfinding', 'device', 'disease', and 'tubesandlines', excluding 'technicalassessment'). As for the 'texture' category, it occurs less frequently compared to the other categories and is tricky; it's more of a modifier that can attach to other attributes rather than an independent attribute.
@zihui-debug For the 'nlp' category, we've replaced its concept with our own "abnormality" concept by pre-defining it as a superset of the other four categories (i.e., 'anatomicalfinding', 'device', 'disease', and 'tubesandlines', excluding 'technicalassessment'). As for the 'texture' category, it occurs less frequently compared to the other categories and is tricky; it's more of a modifier that can attach to other attributes rather than an independent attribute.
Very helpful answer, thanks! About the nlp attribute (with the value normal or abnormal), how to deal with the case where both normal and abnormal annotations exist for the same object in the Chest Imagenome gold dataset? For example, on approximately 5061/5076 lines in the gold_object_attribute_with_coordinates.txt file (below is a screenshot of the file converted to csv), upper mediastinum is labeled both normal and abnormal.
@zihui-debug Sorry for the late reply! Yes, I guess these are unexpected scenarios that happened in their dataset (even in the gold dataset). For a given study id, raw-level annotations (i.e., sentence-level) can hold different attributes according to different sentences. So sentence A might be labeled as (upper mediastinum, normal, yes) and sentence B labeled as (upper mediastinum, abnormal, yes). It makes sense because each sentence has different semantics. In most cases (from my perspective), they aggregate them as follows: if there is even one abnormal label, then the final label for whether the upper mediastinum is normal or abnormal is set to abnormal. This is a natural way to conclude the final label for the report-level using sentence-level annotations.
Thank you for your outstanding work! I'm sorry to bother you, but I still have questions about some of the details of the data construction process. I'm wondering how you choose the value of each placeholder when fill in the question templates? The Chest ImaGenome dataset contains a large number of annotations for anatomical regions and attributes that, if used in full, would yield quite a number of samples. So what's the specific sampling strategy? For example, when constructing an abnormality choose type question related to an X-ray, are the anatomical area randomly selected? In addition, how to set the negative sample in the candidate option?