About MIMIC-CXR-VQA - Githubissues

baeseongsu / ehrxqa

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images, NeurIPS 2023 D&B

MIT License

65 stars 4 forks source link

About MIMIC-CXR-VQA #5

Closed Eldo-rado closed 6 months ago

Eldo-rado commented 6 months ago

👋 Hi! Great work, and thank you for your efforts.

MIMIC-CXR-VQA is generated based on Chest ImaGenome. I have recently been looking into content related to Chest ImaGenome and have some questions I'd like to share and discuss with you.

I think there is redundancy in the region-sentence pairs provided in Chest ImaGenome, e.g., a region bbox may correspond to many descriptive sentences, and often their relevance is not very high. Additionally, they do not strictly correspond one-to-on, e.g., for the left lung, the descriptions provided typically refer to the entire lung. Furthermore, the bboxes in the silver dataset also seem to be not very accurate.

Have you encountered these issues in your processing? Will these issues affect our research on bbox-sentence pairs in the future?

Thank u in advance! 😊

baeseongsu commented 6 months ago

Hi, @Eldo-rado

Sorry for the late reply, and thank you for reaching out to me. I hope my thoughts will be helpful in addressing some of your questions.

Regarding your statements:

I think there is redundancy in the region-sentence pairs provided in Chest ImaGenome, e.g., a region bbox may correspond to many descriptive sentences, and often their relevance is not very high.

A region bbox may correspond to multiple descriptive sentences because the same region bbox can be present in various sentences. For example, "left lung shows opacity. Also, pneumonia presents." would result in two sentences referring to the same "left lung" anatomy or bbox. However, semantically, each sentence may focus on different subregions within the left lung (e.g., the one sentence may refer to the full region of the left lung, while the other one may only pertain to the part affected by the diagnosis). These nuances are not resolved in the Chest ImaGenome dataset, which is one of its limitations.

I also agree that some mappings may not be highly relevant because Chest ImaGenome authors used a pre-defined text mining algorithm to extract (object, attribute) pairs from each sentence independently without (1) analyzing the CXR images or (2) considering semantics from all sentences.

Additionally, they do not strictly correspond one-to-on, e.g., for the left lung, the descriptions provided typically refer to the entire lung.

In the example you mentioned, a one-to-one mapping may not always be possible due to the composition of anatomical locations. For instance, when the descriptions refer to the entire lung (e.g., lung ...), it is connected with both lungs (at the lung-level), right lung, and left lung. It makes sense that the left lung is associated with those descriptions.

This is why they followed the object-object, object-attribute, attribute-attribute ontology when mapping (object, attribute) pairs from the sentences. Please refer to the Chest ImaGenome papers for more details.

Furthermore, the bboxes in the silver dataset also seem to be not very accurate.

Yes, as you mentioned, the silver dataset is machine-generated, so the accuracy of the bboxes depends on the performance of the detection model.

The bbox detection model used in the Chest ImaGenome dataset is not state-of-the-art and may not have enough data to accurately capture even the left or right lung. If needed, you could consider building a better bbox detection model using state-of-the-art algorithms from the general domain.

Most open-source bbox detection models for CXR images are limited to lung-level (left lung, right lung) and cardiac regions, making it challenging to capture other small areas (e.g., aortic).

Finally, in the MIMIC-CXR-JPG dataset, some images are ill-positioned (tilted or partially cut off, e.g., clavicle), while others are incorrectly labeled (e.g., labeled as frontal view but actually taken as lateral view). These issues make it difficult for algorithms to detect accurately.

To mitigate this, we added a simple but effective preprocessing technique to obtain well-conditioned images (which include all bounding boxes and remove outliers).

baeseongsu commented 6 months ago

By the way, I am confused about the concept of (region, sentence) pairs. At which level of region do you mean "region" is the corresponding part of the sentence? For example, in the sentence "pneumonia presents in the left lung", what should be the best "region" for your research work or idea?

Eldo-rado commented 6 months ago

By the way, I am confused about the concept of (region, sentence) pairs. At which level of region do you mean "region" is the corresponding part of the sentence? For example, in the sentence "pneumonia presents in the left lung", what should be the best "region" for your research work or idea?

If the sentence is "pneumonia presents in the left lung" and the original report does not specify pneumonia in a more specific region (e.g., upper left lung), I believe the best region can be considered as the left lung.

My understanding is that if there is no more specific localization of a disease within the entire report, we can still use a broader range. For example, in the sentences "left lung shows opacity. Also, pneumonia presents." both refer to the same left lung (bbox). However, if the report specifies "pneumonia presents in the lower left lung" perhaps pneumonia can directly correspond to the lower left lung (bbox) without the need to correspond again with the left lung. In Chest ImaGenome, in addition to pneumonia-left lower lung, there is still pneumonia-left lung. This may not be incorrect, but it seems to be somewhat redundant？

And I am more concerned about some issues caused by inclusive relationships. Perhaps the above example is not very obvious. In another example, when the sentence is "pneumonia presents in the lungs", the pairs in Chest ImaGenome would include "pneumonia presents in the lungs"-left lung (bbox) and "pneumonia presents in the lungs"-right lung (bbox).

I think this correspondence may not be so rigorous. Maybe it doesn't have much impact on the VQA task, for example, it can be constructed as:

Q: is there pneumonia presents in the left lungs？ A: yes.

However, in the report generation task, it seems unreasonable to require the model to infer the sentence "pneumonia presents in the lungs" only through the single-sided left/right lung bbox?

I'm not sure if my understanding is correct, and I would appreciate hearing your thoughts!

Eldo-rado commented 6 months ago

Hi, @Eldo-rado

Sorry for the late reply, and thank you for reaching out to me. I hope my thoughts will be helpful in addressing some of your questions.

Regarding your statements:

I think there is redundancy in the region-sentence pairs provided in Chest ImaGenome, e.g., a region bbox may correspond to many descriptive sentences, and often their relevance is not very high.

A region bbox may correspond to multiple descriptive sentences because the same region bbox can be present in various sentences. For example, "left lung shows opacity. Also, pneumonia presents." would result in two sentences referring to the same "left lung" anatomy or bbox. However, semantically, each sentence may focus on different subregions within the left lung (e.g., the one sentence may refer to the full region of the left lung, while the other one may only pertain to the part affected by the diagnosis). These nuances are not resolved in the Chest ImaGenome dataset, which is one of its limitations.

I also agree that some mappings may not be highly relevant because Chest ImaGenome authors used a pre-defined text mining algorithm to extract (object, attribute) pairs from each sentence independently without (1) analyzing the CXR images or (2) considering semantics from all sentences.

Additionally, they do not strictly correspond one-to-on, e.g., for the left lung, the descriptions provided typically refer to the entire lung.

In the example you mentioned, a one-to-one mapping may not always be possible due to the composition of anatomical locations. For instance, when the descriptions refer to the entire lung (e.g., lung ...), it is connected with both lungs (at the lung-level), right lung, and left lung. It makes sense that the left lung is associated with those descriptions.

This is why they followed the object-object, object-attribute, attribute-attribute ontology when mapping (object, attribute) pairs from the sentences. Please refer to the Chest ImaGenome papers for more details.

Furthermore, the bboxes in the silver dataset also seem to be not very accurate.

Yes, as you mentioned, the silver dataset is machine-generated, so the accuracy of the bboxes depends on the performance of the detection model.

The bbox detection model used in the Chest ImaGenome dataset is not state-of-the-art and may not have enough data to accurately capture even the left or right lung. If needed, you could consider building a better bbox detection model using state-of-the-art algorithms from the general domain.

Most open-source bbox detection models for CXR images are limited to lung-level (left lung, right lung) and cardiac regions, making it challenging to capture other small areas (e.g., aortic).

Finally, in the MIMIC-CXR-JPG dataset, some images are ill-positioned (tilted or partially cut off, e.g., clavicle), while others are incorrectly labeled (e.g., labeled as frontal view but actually taken as lateral view). These issues make it difficult for algorithms to detect accurately.

To mitigate this, we added a simple but effective preprocessing technique to obtain well-conditioned images (which include all bounding boxes and remove outliers).

Hey, Seongsu! Thank you for your detailed and valuable reply. I will review it carefully.

baeseongsu commented 6 months ago

Hi, @Eldo-rado I left my comments! If you have any further questions or comments, please reach out to me at any time :)

My understanding is that if there is no more specific localization of a disease within the entire report, we can still use a broader range. For example, in the sentences "left lung shows opacity. Also, pneumonia presents." both refer to the same left lung (bbox). However, if the report specifies "pneumonia presents in the lower left lung" perhaps pneumonia can directly correspond to the lower left lung (bbox) without the need to correspond again with the left lung. In Chest ImaGenome, in addition to pneumonia-left lower lung, there is still pneumonia-left lung. This may not be incorrect, but it seems to be somewhat redundant？

Yes, I understand, and you're correct about the potential redundancy. From a completeness perspective, I believe the authors included the triple ("left lung", "present", "pneumonia") to capture the entailment of the object-attribute relationship, meaning that the presence of pneumonia in a specific part of the lung (e.g., left lower lung) implies its presence in the larger region (e.g., left lung). More specifically, if a study X states that the "left lower lung" shows "pneumonia", it logically follows that the "left lung" shows "pneumonia". However, the converse is not necessarily true, making this relationship uni-directional. While including both triples ensures thoroughness, it may introduce some redundancy in the dataset.

And I am more concerned about some issues caused by inclusive relationships. Perhaps the above example is not very obvious. In another example, when the sentence is "pneumonia presents in the lungs", the pairs in Chest ImaGenome would include "pneumonia presents in the lungs"-left lung (bbox) and "pneumonia presents in the lungs"-right lung (bbox). I think this correspondence may not be so rigorous. Maybe it doesn't have much impact on the VQA task, for example, it can be constructed as: Q: is there pneumonia presents in the left lungs？ A: yes. However, in the report generation task, it seems unreasonable to require the model to infer the sentence "pneumonia presents in the lungs" only through the single-sided left/right lung bbox?

Yes, I agree that the correspondence in Chest ImaGenome may not be rigorous enough. Since the true information comes from the CXR image itself rather than an (incomplete) report, simply parsing from the report can be ambiguous. For example, if the image shows pneumonia in the left lung but the report states "pneumonia presents", it's unclear whether it affects both lungs, only the left lung, or only the right lung. The Chest ImaGenome dataset seems to default to "both lungs" in such cases, which could lead to incorrect or inconsistent annotations when the image only shows unilateral pneumonia.
However, I don't understand your point about why the model should infer the sentence "pneumonia presents in the lungs" "only through the single-sided left/right lung bbox". Basically, the report generation model should infer the final answer given all bounding boxes, not just a single bbox.

Eldo-rado commented 6 months ago

Hey! Thanks for your reply.

For the report generation task, there is some methods that directly use the region-sentence pairs provided by Chest ImaGenome for generation. For example, given 'left lung', the model needs to generate 'pneumonia presents in the lungs' just like the ground truth. Perhaps some additional alignment is necessary.

Thanks again for your answer, I've gained a lot.

Eldo-rado commented 6 months ago

Hey, Seongsu!

Sorry to bother you again, but after carefully reading the paper and its appendices, I have the following questions that I would like to confirm.

For the Timeframe Adjustment mentioned in section 4.1.1, could you please explain the following paragraph to me? I didn't quite understand its meaning.

To enable relative time expressions like ‘last year’, we set ‘2105-12-31 23:59:00’ as the current time and excluded any records beyond this point. We consider patients without hospital discharge times, due to this exclusion, as currently admitted.

For the Outlier removal mentioned in section B.2.2, could you provide an example for me? I didn't quite understand what is meant here.

We discard images with widths exceeding three standard deviations from the mean for each anatomical location.

I'd like to know how the gender information was obtained. I checked Chest ImaGenome, and the paper states:：

‘gender’ and ‘age_decile’ demographics (from MIMIC-CXR’s metadata)

But I didn't find any gender-related information in mimic-cxr-2.0.0-metadata.csv. The column headers are as follows:：

dicom_id, subject_id, study_id, PerformedProcedureStepDescription, ViewPosition, Rows, Columns, StudyDate, StudyTime, ProcedureCodeSequence_CodeMeaning, ViewCodeSequence_CodeMeaning, PatientOrientationCodeSequence_CodeMeaning

In MIMIC-CXR-VQA, there are 36 objects, but in the original Chest ImaGenome paper, it seems that 29 objects were used. What is the difference between them?
For timestamps, we can find StudyDate and StudyTime in mimic-cxr-2.0.0-metadata.csv. However, the authors of MIMIC-CXR explain on their official website that StudyDate represents an anonymized date for the radiographic study, and all images from the same study will share the same date and time. Can we infer based on StudyDate?

Thank you in advance!

baeseongsu commented 6 months ago

Hi, @Eldo-rado!

I left my comments as follows:

For the Timeframe Adjustment mentioned in section 4.1.1, could you please explain the following paragraph to me? I didn't quite understand its meaning.
Context: To enable relative time expressions like ‘last year’, we set ‘2105-12-31 23:59:00’ as the current time and excluded any records beyond this point. We consider patients without hospital discharge times, due to this exclusion, as currently admitted.

When dealing with relative time expressions like "last year" in user queries (e.g., "How many patients had Dx code XYZ in the last year?"), it's essential to establish a reference point for the current time. Without specifying the current time, it's impossible to determine the exact time span for "last year" or any other relative time expression.

After shifting patients' medical records (original MIMIC-IV and MIMIC-CXR timeframe is too broad), there could be some medical events beyond the current time. However, since we set up the current time, it is rational to conclude that there should be no medical records beyond this point, as they would represent future medical events. Therefore, we remove these medical records.

Aligning with this perspective, we consider patients who do not have a discharge time event as currently admitted patients.

For the Outlier removal mentioned in section B.2.2, could you provide an example for me? I didn't quite understand what is meant here. Context: We discard images with widths exceeding three standard deviations from the mean for each anatomical location.

Please check our preprocessing code: https://github.com/baeseongsu/mimic-cxr-vqa/blob/984da408fc5421d8f3718c58859b9048077f190a/dataset_builder/preprocess_cohort.py#L152

I'd like to know how the gender information was obtained. I checked Chest ImaGenome, and the paper states:： Context: ‘gender’ and ‘age_decile’ demographics (from MIMIC-CXR’s metadata)
But I didn't find any gender-related information in mimic-cxr-2.0.0-metadata.csv. The column headers are as follows:： Context: dicom_id, subject_id, study_id, PerformedProcedureStepDescription, ViewPosition, Rows, Columns, StudyDate, StudyTime, ProcedureCodeSequence_CodeMeaning, ViewCodeSequence_CodeMeaning, PatientOrientationCodeSequence_CodeMeaning

Since our dataset is based on MIMIC-CXR, Chest ImaGenome, and MIMIC-IV, you can find the demographic information by linking the relevant data from the patients.csv file in MIMIC-IV and the patient ID (i.e., subject ID) in the MIMIC-CXR or Chest ImaGenome dataset.

In MIMIC-CXR-VQA, there are 36 objects, but in the original Chest ImaGenome paper, it seems that 29 objects were used. What is the difference between them?

The object pool (parsed from the text, 36 anatomical objects) and object pool used for bounding boxes (29 anatomical locations) are different. For more details, please check the Chest ImaGenome in details: https://physionet.org/content/chest-imagenome/1.0.0/

For timestamps, we can find StudyDate and StudyTime in mimic-cxr-2.0.0-metadata.csv. However, the authors of MIMIC-CXR explain on their official website that StudyDate represents an anonymized date for the radiographic study, and all images from the same study will share the same date and time. Can we infer based on StudyDate?

Yes, you're right. All images from the same study will share the same date and time. I am not sure if we can infer the exact time for each image within the same study from the original DICOM file's meta information (MIMIC-CXR dataset, not MIMIC-CXR-JPG). However, I guess we probably cannot infer or differentiate the exact time of each image.

Eldo-rado commented 6 months ago

Thank you for your response.

Regarding the sixth point, if we don't know the exact time, how should we handle "compare to the previous x-ray" mentioned in the report?

Once again, I express my gratitude and admiration for the excellent dataset integration, cleaning, and restructuring work.

baeseongsu commented 6 months ago

@Eldo-rado,

Yes, that's a good point. In our research work, within a CXR study, all images come from the same patient status because the time gap between each image is subtle. Therefore, we regard the previous X-ray image as the previous X-ray study, considering the user's information-seeking needs. The keyword "previous" is more applicable for comparing between studies rather than between images within a single study.

Eldo-rado commented 6 months ago

I'm a bit confused about the meaning of "we regard the previous X-ray image as the previous X-ray study."

Using the file structure below as an example, my understanding is that p11000416 (subject_id) contains all the examinations for a patient, and s55590752 (study_id) represents a single examination. However, since we cannot determine the chronological order between study_ids, when s55590752 mentions "previous x-ray," which study_id should we choose as the "previous x-ray"? f66fef5ab7420c19d463b959f426c804

baeseongsu commented 6 months ago

@Eldo-rado,

First, in the MIMIC-CXR-JPG dataset, we can determine the chronological order between study IDs using the study date and time. As mentioned in the previous comment, I want to clarify that we cannot determine the chronological order between image IDs within a single study.

In your case, if the user mentions the previous X-ray study of the study "s55590752," you can find the previous CXR study, if it exists, by using the order of study date and time. In other words, if the user mentions the previous study of the (i+1)-th study of a patient, then the previous CXR study will be the i-th study.

You can find related information about temporality with CXR in the metadata from the MIMIC-CXR-JPG dataset and other research works, such as MS-CXR-T or CheXRelNet. Additionally, the Chest ImaGenome dataset contains descriptions on this topic. You can search for the keyword "chronologically" in the Chest ImaGenome paper to find relevant information.

Eldo-rado commented 6 months ago

I understand your point. However, I'm concerned about whether this approach might be inaccurate.

For example, in MIMIC CXR JPG, it states: "StudyDate - An anonymized date for the radiographic study." I want to confirm if even though it's an anonymized date, the chronological order is still correct. Additionally, for StudyDate, is it assumed that a larger value represents a more recent time? And do we assume that the "compare to ___" mentioned in the report is compared to the most recent previous x-ray in terms of date?

I'm not sure if my concern is redundant or if there's information I haven't paid attention to. Thank you for your patient answers.

baeseongsu commented 6 months ago

@Eldo-rado

Thank you for specifying your concerns. I hope the MIMIC-CXR paper (https://www.nature.com/articles/s41597-019-0322-0) helps you understand the above points. I'll just refer to the relevant content:

For example, in MIMIC CXR JPG, it states: "StudyDate - An anonymized date for the radiographic study." I want to confirm if even though it's an anonymized date, the chronological order is still correct.

Additionally, for StudyDate, is it assumed that a larger value represents a more recent time?

"For anonymization purposes, two sets of random identifiers were generated. First, a random identifier was generated for each patient in the range 10,000,000–19,999,999, which we refer to as the subject_id. Each patient was also assigned a date shift which mapped their first index admission year to a year between 2100–2200. This ensures anonymity of the data while preserving the relative temporality of patient information, which is crucial for appropriate processing of the data. Note that this date shift removes all information about seasonality or day of the week for the studies."
In https://physionet.org/content/mimic-cxr-jpg/2.1.0/, please refer to the description of StudyDate: "StudyDate - An anonymized date for the radiographic study. All images from the same study will have the same date and time. Dates are anonymized, but chronologically consistent for each patient. Intervals between two scans have not been modified during de-identification."

And do we assume that the "compare to ___" mentioned in the report is compared to the most recent previous x-ray in terms of date?

"During routine care, radiologists have access to brief text summarizing the underlying medical condition, the reason for examination, and prior imaging studies performed. The PACS workstation used by clinicans to view images allows for dynamic adjustment of the mapping between pixel value and grey-level display (“windowing”), side-by-side comparison with previous imaging, overlaying of patient demographics, and overlaying of imaging technique. Reports are transcribed during reading of an image series using a real-time computer voice recognition service."

Based on my personal discussions with other clinicians, for routine care, they do not always use only the most recent previous study when comparing with the current one for writing prior information.

Eldo-rado commented 6 months ago

Thank you for your answer. I initially did not read the original paper of MIMIC CXR, which was my negligence. Thanks again, I learned a lot.