Concept Mapping / Aliases

jeblundell commented 8 years ago

I was wondering whether people had opinions on the contribution of supplementary data. I've started to try to analyse chartevents and it's a little frustrating to find out that PaO2 and PaCO2 have about 5 or 6 different item names each. It would be particularly helpful if we could come up with a way of aliasing them.

I would obviously very much prefer to not contaminate the mimiciii schema and keep that purely for the original MIMIC data, but perhaps we could add a contrib schema for data such as these? (Maybe contribiii to make it clear that it goes with mimiciii?)

contrib.aliases alias_id aliasname

contrib.d_items_aliases row_id alias_id (references contrib.aliases.alias_id) itemid (references d_items.itemid)

This does introduce a whole host of potential issues, again relating to database versioning: I can hopefully do a sensible first-pass of various measures (PaO2, PaCO2, pH etc.), but there may come a point where another intensivist may turn around and say that actually "SpO2 XYZ" is actually not really the same as "SpO2 ABC" and "SpO2".

Would a possible solution in this case be to add some sort of version column to aliases? So that if someone decides to do their analysis with my incredibly flawed versions of aliases, then they can still replicate their data or (hopefully) choose the later version.

Another issue to consider with this sort of schema is whether there should be a unique mapping of items to aliases, i.e. could someone theoretically create a view that does SELECT * FROM chartevents and tacks on a column representing aliasname? I'm tempted to argue against this as it would potentially preclude versioning/other alias interpretations etc.

Thoughts?

tompollard commented 8 years ago

Thanks for starting this conversation Jim.

Consolidating concepts is something that we've discussed a fair amount, but haven't yet addressed. We actually left an empty 'conceptid' column in the d_items table to act as a placeholder for the merged itemids.

@parisni will be looking at consolidating concepts as part of his PhD and I think @mghassem knows of others working on it too.

If you would like to make a start on the merging the concepts immediately, one option would be to create a new folder in the MIMIC Code Repository called something like "conceptmaps", and then add an SQL script to generate a table with itemids mapped to a set of consolidated ids.

Versioning is an issue that we need to address across the repo, but I think adding a "last updated" column to the concept map would be a good start.

turambar commented 8 years ago

@jeblundell @tompollard as per our email exchange, I've also begun this effort as well with the help of my collaborator Zack Lipton from UCSD and some folks at both Children's Hospital LA and Stanford. A TL;DR summary of my comments:

Zack and I are creating and publishing some benchmark clinical multivariate time series tasks and data sets -- clinical "MNIST" as it were. The first will be related to classification (i.e., phenotyping). I know some other folks who want to contribute to this effort and use the data for other purposes.
Our intention is to NOT to share an actual transformed data set -- or to modify the MIMIC3 data itself -- but rather to publish code (python scripts + YAML configs) that anyone with a MIMIC3 download can use to create the benchmark data set. This way we can properly track and version revisions of the benchmark.
I have already mapped ~250 lab and chart ITEMs to conceptual variables. I have shared the mapping Google Sheet with you guys, but if others are interested, let me know. The mapping was done manually and without much clinician input, but I've mostly handled obvious cases, so it's likely mostly correct.
I have a (hastily written) python pipeline that performs the mapping, handles units transformations, handles edge cases (e.g., temperatures that are clearly F marked as C), etc.

Dave

turambar commented 8 years ago

@jeblundell @tompollard regarding contaminating the schema, versioning, etc.: I was leaning toward keeping my effort SEPARATE from the core MIMIC3 store. I'd like a community of researchers to be able to agree on and use a benchmark without imposing it on everyone else. That said, perhaps there is a way to have our cake and eat it, too. Maybe "outside" efforts creating task-specific mappings can just forge on ahead as they see fit but then a sanctioned "internal" effort (which might include some of the same people) can synthesize feedback from those efforts and create a canonical mapping that goes into the MIMIC3 store itself.

jeblundell commented 8 years ago

Myself and @nickopotamus can certainly help from a physician point of view and some of the grunt work of getting some mappings done.

I was planning on submitting a pull request with provisional approach and some input from @nickopotamus that could be built on/rejected by the team, so can do that and see where we go?

turambar commented 8 years ago

@jeblundell happy to take a look at whatever you have. If you have a branch or pull request # or something on github, send it my way (I can only look -- no actual power over MIMIC repos).

turambar commented 8 years ago

Some additional thoughts:

Our collaborative mapping framework should have a low overhead and should NOT impose a particular toolchain (python, SQL, etc.). In particular, the "mapping" should be stored (or exportable) in some simple file format readable by ubiquitous tools, rather than, say hard-coded in Python or SQL. In the short term, maybe we can start with something like the Google (spread)Sheet I have created.
We need to ensure that collaboration does not create blockers for individual projects. An example: mapping things to known nomenclatures and ontologies. This is incredibly valuable long-term but not be strictly necessary for, e.g., building benchmark datasets for the ML community. We'd be satisfied with the best effort of a clinical expert (or motivated grad student armed with Wikipedia) in identifying variables and mapping ITEMs.

jeblundell commented 8 years ago

Thanks for the spreadsheet - looks like an epic amount of work you've done, so good job!

Very much agree with your thoughts re: simple format and toolchain neutrality. It looks like we're gravitating towards CSV as straightforward as 2 columns of concept name and itemid. What've you got as your general schema at the moment for generating the data behind that spreadsheet?

Our first pass is with an alias_id, but I can always adapt our SQL scripts to in such a way that the CSV is still useful, i.e. we'll try to get our alias name's to match up to whatever names you have in your spreadsheet so you don't have to put unnecessary work in (and will let you know if we need to make changes).

Few of things from medicine POV:

pH from ABG and VBG we can probably safely bundle together as although there is a systematic difference between them, it's small enough to not be too worried about for our purposes. Though that's just my view :)
Although you've not done pO2 yet: very definitely cannot bundle arterial and venous O2 together.
Bundling invasive and non-invasive blood pressure together's going to be fairly problematic as although they're correlated, non-invasive in particular is somewhat fraught with issues (blood pressure cuff size, position, automated vs manual, presence of atrial fibrilation). Does raise another thing to add to the pile of issues for sensible mappings: in the absence of an art-line, non-invasive is better than nothing. However, in the presence of a (fully working) art-line, probably best to ignore the non-invasive. So really there needs to be, at some point, a way of deciding whether the art-line's giving sensible readings, in which case ditch the NIBP, and just use ABP, otherwise use NIBP as it's a reading of some sort. For now, best separate into ABP and NIBP.
Similarly, if a patient only has a weight for drug calculations, then that's a better estimate than nothing, but absolutely want to scrap it if there's an actual weight: reason being that some drug doses will be on ideal body weight, which will usually be a lot lower for people with a high BMI.
Basically at some point (almost certainly not in our first pass as it's a whole host of problems) we need a way for things to have a priority flag: something to indicate "if nothing like this today/this minute/this hour/whatever, then you can use this if you must and is basically the same"
Conveniently some of the ones that you seem to be missing were some of the ones that I started on (FIO2, etCO2, pO2, pCO2 etc.), so can hopefully fill in the blanks. I've similarly run into an issue like with the Fahrenheit vs Celcius issue, where FIO2 is sometimes 0-100 and sometimes 0-1.

I'll get working with @nickopotamus on finishing up our set of mappings and we can then try and merge them with yours to get a definitive first version list (then sort out the various discrepancies between us). Rather nice that we've effectively done the standard >1 person systematic review approach to make sure we cover all the bases properly!

jeblundell commented 8 years ago

Btw, what does the mean/stdev correspond to? Definitely very sensible to check for crazy outliers (misbehaved units!), though am a little concerned that there's even a single patient with a pH <1!

nickopotamus commented 8 years ago

Thanks @jeblundell for including me in this thread; such a mapping was looking a mammoth task for my own work using MIMIC, but as a relative SQL novice I can probably contribute more to the clinical side here.

Personally I'd handle venous and arterial values independently; there are subtle and not so subtle differences between central and mixed venous, capillary, and arterial gases. Once they're identified as such it would be straightforward for any clinical application which treats them as equivalent to then strip a value of pH from the available gases.

The same applies to blood pressure measurements: flagging NIBP and IBP as independent variables allows comparisons between the two, which can be interesting in its own right, and it would be trivial to extract "a blood pressure" if all potential IBP and NIBP aliases are known.

arupakaa commented 8 years ago

@turambar, @jeblundell

Hi, all! This sounds very similar to something I have been looking to get started on. And it seems like you have already put in a lot of good work. :) What kind of assistance could you use?

"Grunt work" sounds like what I would be useful for at this stage. I do have access to Wikipedia, and perhaps also Google...

Cheers,

turambar commented 8 years ago

@aruberutou @nickopotamus if you have preferred Google Drive accounts, send them to me and I can give you access to my spreadsheet.

arupakaa commented 8 years ago

@turambar Thanks! I have just sent an email to your listed account. Cheers,

mghassem commented 8 years ago

@turambar @nickopotamus @aruberutou I'd be strongly in favor of the first mapping being simple variable lookups, and I'm happy to be an additional reviewer on variables/item_id mappings. Side note - there's a rich literature on mapping signals to existing ontologies, and I can think of several people who would be very interested in contributing once the project moves that direction.

turambar commented 8 years ago

I'm gonna give replying via email a shot. Let's proceed with editing my spreadsheet. I'm added @nickopotamus and @aruberutou as editors. Anyone else who wants edit access, please let me know.

If you guys want to "unbundle" variables (invasive and noninvasive BP, pH, O2 sat, etc.), go ahead and do it in the column "LEVEL1." We will preserve my bundling (and maybe bundle other variables) in the "LEVEL2" column. If you need to correct things I've done, go right ahead.

I will try to add medications, inputs, etc., to this in the next day or two.

Please do start identifying new variables! I'm hoping you guys can help me identify and prioritize variables of interest -- anything you add for your project will be of interest to me. If you need a place to start, I'd suggest end-tidal CO2 and urine output, which I haven't been able to unravel in MIMIC. Perhaps also anything related to ventilation.

My workflow tends to be to use text search (either browser or the more robust Sheet "Find" that allows regular expressions) to find likely candidates. For a particular variable ("HR"), I look for all names ("heart rate"), variants ("pulse rate"), abbreviations ("HR"), misspellings ("heat rate"), etc. I mark all candidates (put "HR" in LEVEL1 column). I tend to stop once counts drop below a threshold (tens to hundreds, depending on total count). For those candidates I feel confident about, I put an "X" in the "USE" column. At that point, I usually pull all of the values for all candidate ITEMS and look at distributions (percentiles) to try to identify possible issues: outliers, different units, etc. As a non-MD, I often arm myself with some prior information using, e.g., Wikipedia. :-P

Two other steps:

1) Defining ranges and "normal" for each mapped variable. I have another spreadsheet for that, which I'll share with you. What I do is define:

dropBelow, dropAbove: a value below (above) which a measurement is clearly invalid and should be tossed
minValue, maxValue: minimum (maximum) allowed values. Things below minValue (but not dropBelow) are replaced by minValue. Same for maxValue and dropAbove.
normal: used for some imputation schemes

Note that for my work, I also use minValue and maxValue to rescale all values to [0, 1].

FYI, credit where credit is due: the terminology and structure of that spreadsheet is due to David Ledbetter and folks in the VPICU at Children's Hospital LA.

2) If I find a variable that requires special handling, such as unit conversion (especially if the units are ambiguous or erroneously marked), I just make a note. Right now I handle those as one offs, but maybe we can systematize. I'll add a Google Doc for this and add notes about the special cases I've already found, then share it.

Dave

nickopotamus commented 8 years ago

Cheers Dave. I've just started looking through the work you've already done - a mammoth undertaking!

I've come up with a couple of issues looking at bilirubin that it might be worth clarifying for continuity and ease of reference:

There are measurements both of serum total bilirubin, and the various fractions (protein-bound and unbound, conjugated/direct and and unconjugated/indirect). These can't be treated as equivalent for data analysis.
Have we a standard nomenclature for differentiating these? If not, shall we agree on one before doing too much processing? I've started calling them e.g. bilirubin-total however that's not very elegant. We could do with defining a system for breaking down subtypes of measurements.
Similarly there are also both continuous and categorical (e.g. urine dip) measurements of bilirubin in other fluids (urine, CSF, pleural fluid) which will need a naming standard to handle
Finally there are lots of chartevent itemids which are purely recording whether a bilirubin has been sent, mostly from the neonatal unit. Shall we have a flag for these to indicate that one of us has looked at these and decided that they're not relevant?

turambar commented 8 years ago

@nickopotamus thanks! By the way, you've taught me something new for Google Sheets (i.e., use of scripts and filters).

re: bilirubin, good to know. I'm pretty sure I made similar mistakes on other labs, e.g., WBC. Would LOVE to be better understand the different types and our options for handling them (i.e., can we integrate constituent parts into total?). I think that given enough data, my model of choice (neural nets) could learn functions for combining them or, say, estimating the total Bilirubin even with some constituents missing. Likewise for, say, estimating a latent "true pH" from just venous or arterial pH.

On a related note, I'd like to understand the nuances of measuring labs from the different body fluids. For example, what's the difference between blood, serum, pleural, ascites, urine, joint fluid, other body fluid, etc.

re: nomenclature, do whatever you think is best for now. We can iterate from there. I'd err on the side of being descriptive. I know I used some abbreviations (a convention I adopted by the Physionet Challenge 2012 data set, which I used as a target list of variables), but I'm going to change these back to full names, I think.

re: marking things as "not to be used" I suggest putting a "n" in the USE column. I've likewise changed "x" to "y." Blank entries are not yet decided.

I also added a NOTE column at the end for adding explanations.

nickopotamus commented 8 years ago

@turambar I'm happy to trade medical information for code any time ;) I'll send you an email with a bit more about bilirubin over the weekend...

I'll start going through the things you've already mapped in ernest next week before starting on new variables. Is it worth us also filling in the FLUID flag for ITEMIDs where it isn't already populated? This would prevent having names like bilirubin-serum and bilirubin-CSF, or all the various fluids sodium can be measured in? This would mean you'd have to select "biological fluid of interest" when wanting to extract variables; personally I think that makes it more straightforward to manipulate, but I'd be interested in your thoughts on how this would work with your code?

rgmark commented 8 years ago

You MUST differentiate bili-serum and bili-CSF. They cannot be mixed up! R Mark

nickopotamus commented 8 years ago

Absolutely @rgmark; I suppose the question I'm asking is what's the best way to do this. Do we give each alias a descriptive name including the measurement and fluid (e.g. 'bilirubin-total-serum' and 'bilirubin-csf') or do we provide a fluid flag for each alias (e.g. name=bilirubin, fluid=blood and name=bilirubin, fluid=csf)?

rgmark commented 8 years ago

use LOINC codes for all labs

nickopotamus commented 8 years ago

@rgmark that is exactly the type of schema I was looking for - thanks! @turambar perhaps we can code the aliases using the LOINC codes, then use your script (+/- aspects of the downloadable LOINC packages) to find the corresponding variables from human readable versions?

ahmed-alhindawi commented 8 years ago

Hi all, sorry for coming late to this game but I've been pondering a the problem for a while now. The best I could come up with is a tree structure that would help with aliasing.

The issues outlined above with regards to PaO2, NIBP/BP, Bilirubin...etc should be solvable if we nest the aliases.

E.g. top level for PaO2 could split into venous PaO2, arterial PaO2, central PaO2, Pulmonary PaO2 or if the researcher wishes to completely ignore the differences, they can just use the top level code with the aliases "flattened".

Thoughts on this approach?

nickopotamus commented 8 years ago

Hi @ngageorange sound sensible, but using a tree raises some confusion as to what point you do the branching (using the bilirubin example, do you branch at total/direct/indirect, or which fluid?). I think the best approach is using LOINC codes as @rgmark suggested, which would then allow such trees to be generated as users wish or using @turambar's scripts. See https://search.loinc.org/ for examples.

turambar commented 8 years ago

@rgmark @nickopotamus @ngageorange no objections to using LOINC codes for labs. No reason to re-invent an ontology where there already exists one -- assuming it is easy to use. Of course, learning such structure from the data would probably work even better (see https://arxiv.org/abs/1602.05568), but that's for future work!

In seriousness, from my POV, I'm trying not to let the perfect be the enemy of the good here. My collaborators want to release a task benchmark data set ASAP, so I'm going to take what we have so far and codify it in an initial version. But I'll accept whatever updates you guys can make in the near future!

turambar commented 8 years ago

If anyone on this thread would like an invitation to my private repository where I'm developing the python code to build benchmark data sets, let me know.

tompollard commented 8 years ago

Yes please @turambar! I've been meaning to catch up on this conversation...

rustyBilges commented 7 years ago

Yes please @turambar - I'd be happy to help also if you are still working on this!

Saqibm128 commented 7 years ago

Is this still an active area? Or has the itemid mapping effort died done somewhat?

turambar commented 7 years ago

@rustyBilges @Saqibm128 we've taken the initiative on this, resulting in the benchmark data set described in this manuscript and whose code is hosted here and maintained by the YerevaNN lab.

We are very interested in having other folks contribute by helping us expand and enrich the benchmark. If you'd like to get involved head over to the benchmark repository and start a thread or send us a PR. In particular, here are areas of need:

expand mapping to cover additional variables
add treatments, medications, inputs to the data set
add new prediction tasks
add clinical notes

Yovoss commented 6 years ago

@turambar Hi! I have browsed almost of the codes and paper that you mentioned before. And i have been focusing on the solution of the confusion of ID_ITEMS table which is definately a tedious and important work. So It looks like a great amount of work you've done and good job! I have a question about the file itemid_to_variable_map.csv in https://github.com/YerevaNN/mimic3benchmarks/tree/master/mimic3benchmark/resources, there are just about 370 ITEMIDs mapped as much as i know, so if this work is still going on, i'd like to contribute my power. Best wishes!

MIT-LCP / mimic-code

Concept Mapping / Aliases #107