Closed Qianxiaoxie917 closed 8 months ago
Yeah that sounds like a data quality issue. Not sure why, but if you think about how many tens of thousands of individuals have pregnancy marked as 0, it's not unreasonable that ~50 or so errors in documentation are made. I usually solve this by identifying two or more documented data elements. This is what you are doing (with dateevents and chartevents). If I look at the group of individuals with both the itemid
you reference though, I only see errors for itemid
225082 (chartevents). The due dates from datetimeevents seem to accurately identify pregnancy. I would guess this is a workflow thing - they probably have to document pregnant yes/no for everyone, but they only actively document a due date when they need to. And as a result, the due date itemid
is better quality.
It depends on your study, but if you want accurate identification of pregnant individuals at a cost of sample size, I would require both due date and the pregnancy flag to be 1. Otherwise, you could consider parsing through the discharge summaries, since they will always mention pregnancy and there is only ~120 or so individuals with an age > 30.
Thank you very much for your insightful response. Indeed, as you pointed out, due dates from datetimeevents provide a reliable indication of pregnancy, although relying solely on this data significantly limits the sample size.
In my research, obtaining a sufficiently large sample of pregnant women is crucial. However, accurately identifying pregnancy from the discharge summary texts has proven challenging. I've attempted to use keywords like "pregnancy" and "gestational" for extraction. Additionally, I've explored radiology texts and found some relevant information there as well.
Given your experience, how do you efficiently extract accurate pregnancy information from such texts? Also, considering the potential inaccuracies with itemid 225082 in chartevents, would you advise still using it in conjunction with text extraction from discharge summaries and radiology reports? Or is it preferable to primarily focus on text-based information to ensure accuracy?
Hmm you could try looking for a regex like G[0-9]P[0-9]
or [0-9]+w[0-9]d
which are commonly documented for pregnancy. Sometimes the deidentification incorrectly removes the weeks/days though, which makes it harder (sorry!).
I think text-based from the discharge summary is the way to go if you want to increase the sample size. As tedious as it is, if you set up your interface well and subselect to the history of present illness section, you can go through a number of discharge summaries quite quickly.
Does value of itemid 225082 (chartevents) = 0 correspond to people without pregnancy?
Description
I've been working with the MIMIC-IV database to extract information related to pregnant women. For this purpose, I used itemid = 225082 in the chartevents table to identify pregnancies, and itemid = 225083 in the datetimeevents table for pregnant due dates.
Upon merging this data with the age table, I observed an anomaly where the age of women identified as pregnant exceeded 50 years, with this age group constituting over 50% of the dataset. This result is unexpected and has led me to question the accuracy of the data extraction or the underlying data itself.
I'm reaching out to inquire if you have encountered a similar issue or if there might be a known explanation for this observation. Any guidance or suggestions you could offer to help clarify this situation would be greatly appreciated.
Thank you for your time and support.