Investigate/Augment Mappings Defined By The Global Codebook (Tier-1 Data Elements)

jkyu commented 7 months ago

While working on harmonization metrics, it has become clear that some mappings from non-harmonized data elements to Tier-1 data elements are not defined, e.g.,

Some RADx-rad data sets have a map study_id -> nih_record_id. This is not always true -- some data sets do not map study_id to nih_record_id, so that may be a mistake.

There are also cases where question text (as opposed to a header ID) is mapped directly to a Tier-1 data element.

Anyway, this is an issue for logging exploration into Tier-1 data elements in harmonized transformcopy files and finding out what data elements mapped to them when there are discrepancies with the mappings defined by the global codebook.

jkyu commented 7 months ago

[x] Put together a "canonical" list of data element variable names with version tags and one-hot encoding markers collapsed.

jkyu commented 7 months ago

Added the data element variable name list here: https://github.com/bmir-radx/harmonization-metrics-scratch/blob/main/variable_names.json

I noticed that the variable tracker is missing DHT, so DHT variables are not included. Subsets of the variables can be obtained under the keys RADx-UP, RADx-rad and RADx-Tech. There's also a key all that gives a set of data variable names de-duplicated over all programs.

One issue to point out is that some variable names don't follow the expected pattern for cleaning. We do variable_name_2___5 -> variable_name, but some variables are named something like variable_name32. One could imagine removing the trailing numbers, but the following example makes me not want to implement this as a rule: covid_vax_date_dose1, covid_vax_date_dose2, covid_vax_date_dose3

jkyu commented 7 months ago

Included a subset that covers only variable names in the transformcopy files.

jkyu commented 7 months ago

Collected a list of matched files (studies for which both orig and transform copies exist) that contain tier-1 data elements that violate rules prescribed by the global codebook. So far, this seems to be caused by a minor variations in element names (e.g., health status vs health_status) or harmonized data elements that appear without explanation (e.g., nih_zip shows up in the transformcopy when some variation of zipcode does not show up in the origcopy).

https://github.com/bmir-radx/harmonization-metrics-scratch/blob/main/missing_mappings.csv

jkyu commented 7 months ago

What do I do here?

Project 96 (project96_DATA_origcopy_v1.csv and project96_DATA_transformcopy_v1.csv) has five tier-1 data elements in the transformcopy that don't have documented mappings in the global codebook. These are:

nih_smoking_yn
nih_history_smoking
nih_mental_health_disorder
nih_alcohol_yn
nih_alcohol_frequency

The global codebook does not list a mapping for RADx-UP.

I took a set difference of the data element names in the origcopy and the transformcopy. This gives us a list of data elements in the origcopy that are not in the transformcopy. This is presumably the set of harmonized data elements on the input side of the mapping function. I then also pruned this set for any data element names that already have defined mappings. The result is as follows: 'project96': {'alcohol_daysperweek', 'self_reported_height_coded'}

This partially explains what maps to nih_alcohol_yn and nih_alcohol_frequency. Upon further inspection, there is lifetime_use_alcohol is only present in the origcopy (this maps to nih_lifetime_use_alcohol) as is an element alcohol_date_mdy that is present in both the origcopy and transformcopy. It seems that the presence of these other alcohol-related elements resulted in the inclusion of a set of alcohol-related tier-1 data elements.

For nih_mental_health_disorder, there doesn't seem to be anything that mapped to it. My best guess is that cc_depression or cc_otherchroniccond resulted in its inclusion, but these two have their own mappings (to nih_depression and nih_otherchronic_cond, respectively).

Something similar happens for the smoking-related data elements. smoker_cur_stat in the origcopy maps to nih_cig_smoke_freq and the origcopy and transformcopy both contain an element smoker_number. It seems that the inclusion of these smoking terms resulted in the addition of the tier-1 elements nih_smoking_yn and nih_history_smoking.

Here, we have case of multiple many-to-many mappings, where the inclusion of some smoking or alcohol data elements in the origcopy mapped to a whole set of tier-1 data elements during the harmonization process.

bmir-radx / radx-project

Investigate/Augment Mappings Defined By The Global Codebook (Tier-1 Data Elements) #44