Open jkyu opened 7 months ago
Added the data element variable name list here: https://github.com/bmir-radx/harmonization-metrics-scratch/blob/main/variable_names.json
I noticed that the variable tracker is missing DHT, so DHT variables are not included. Subsets of the variables can be obtained under the keys RADx-UP
, RADx-rad
and RADx-Tech
. There's also a key all
that gives a set of data variable names de-duplicated over all programs.
One issue to point out is that some variable names don't follow the expected pattern for cleaning. We do variable_name_2___5
-> variable_name
, but some variables are named something like variable_name32
. One could imagine removing the trailing numbers, but the following example makes me not want to implement this as a rule:
covid_vax_date_dose1
, covid_vax_date_dose2
, covid_vax_date_dose3
Included a subset that covers only variable names in the transformcopy files.
Collected a list of matched files (studies for which both orig and transform copies exist) that contain tier-1 data elements that violate rules prescribed by the global codebook. So far, this seems to be caused by a minor variations in element names (e.g., health status
vs health_status
) or harmonized data elements that appear without explanation (e.g., nih_zip
shows up in the transformcopy when some variation of zipcode
does not show up in the origcopy).
https://github.com/bmir-radx/harmonization-metrics-scratch/blob/main/missing_mappings.csv
What do I do here?
Project 96 (project96_DATA_origcopy_v1.csv
and project96_DATA_transformcopy_v1.csv
) has five tier-1 data elements in the transformcopy that don't have documented mappings in the global codebook. These are:
The global codebook does not list a mapping for RADx-UP.
I took a set difference of the data element names in the origcopy and the transformcopy. This gives us a list of data elements in the origcopy that are not in the transformcopy. This is presumably the set of harmonized data elements on the input side of the mapping function. I then also pruned this set for any data element names that already have defined mappings. The result is as follows:
'project96': {'alcohol_daysperweek', 'self_reported_height_coded'}
This partially explains what maps to nih_alcohol_yn
and nih_alcohol_frequency
. Upon further inspection, there is lifetime_use_alcohol
is only present in the origcopy (this maps to nih_lifetime_use_alcohol
) as is an element alcohol_date_mdy
that is present in both the origcopy and transformcopy. It seems that the presence of these other alcohol-related elements resulted in the inclusion of a set of alcohol-related tier-1 data elements.
For nih_mental_health_disorder
, there doesn't seem to be anything that mapped to it. My best guess is that cc_depression
or cc_otherchroniccond
resulted in its inclusion, but these two have their own mappings (to nih_depression
and nih_otherchronic_cond
, respectively).
Something similar happens for the smoking-related data elements. smoker_cur_stat
in the origcopy maps to nih_cig_smoke_freq
and the origcopy and transformcopy both contain an element smoker_number
. It seems that the inclusion of these smoking terms resulted in the addition of the tier-1 elements nih_smoking_yn
and nih_history_smoking
.
Here, we have case of multiple many-to-many mappings, where the inclusion of some smoking or alcohol data elements in the origcopy mapped to a whole set of tier-1 data elements during the harmonization process.
While working on harmonization metrics, it has become clear that some mappings from non-harmonized data elements to Tier-1 data elements are not defined, e.g.,
Some RADx-rad data sets have a map
study_id -> nih_record_id
. This is not always true -- some data sets do not mapstudy_id
tonih_record_id
, so that may be a mistake.There are also cases where question text (as opposed to a header ID) is mapped directly to a Tier-1 data element.
Anyway, this is an issue for logging exploration into Tier-1 data elements in harmonized transformcopy files and finding out what data elements mapped to them when there are discrepancies with the mappings defined by the global codebook.