ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

January Hackathon Milestones and Deliverables - HuSH+ Data #103

Closed KCB13 closed 6 years ago

KCB13 commented 6 years ago

Milestones:

Deliverables:

xu-hao commented 6 years ago

Just a clarification M5 (rnn) is not planned for the hackathon. I will likely be working on pitch 15 tools for subsetting (both row and column) the "wide-format" table.

karafecho commented 6 years ago

[ ] M6. Create table Tv1a, a simplified version of the full Tv1. From Tv1, select encounter dates between 01.01.2015 and 12.31.16 and the following column headers:

Notes: 'study' start date = 01.01.2014; 'study' period = 01.01.14 through 12.31.16; three rows per patient ID, one for each year in the study period; column headers defined below

- Patient ID - FOR REFERENCE AND DOWNSTREAM ANALYSIS ONLY FOR NOW

- Demographics (HuSH+ birth date, sex, race) Race = White, Black, Asian, Other/Hispanic, Other (all other categories) - INCLUDE ALL CATEGORIES FOR NOW 1 WHITE OR CAUCASIAN 2 BLACK OR AFRICAN AMERICAN 3 AMERICAN INDIAN OR ALASKA NATIVE 4 ASIAN 5 NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER 6 OTHER/HISPANIC 7 PATIENT REFUSED 8 UNKNOWN 9 OTHER RACE Sex = male, female, unknown (unknown = missing) - INCLUDE ALL CATEGORIES FOR NOW Age (calculated from birth date to January 1 of each year of study period) = 0-2, 3-17, 18-39, 40-64, 65+ yo [categories modified from Monk et al. 2005 and Fecho et al. 2008] - USE AGE IN YEARS ON FIRST DAY OF EACH YEAR OF STUDY PERIOD, DO NOT CATEGORIZE FOR NOW

- Encounter date and location (inpatient, outpatient, ED) ED/inpatient visits for respiratory conditions identified by encounter location and diagnosis at encounter (see ICD codes below for respiratory conditions); coded as #ED/inpatient visits for respiratory conditions = 0, 1, >=2 OR 0, 1, 2, >2 between 01.01.2016 and 12.31.2016 (post study start date) AND between 01.01.2015 and 12.31.2015 (pre study start date) - USE TOTAL NUMBER OF ED VISITS FOR RESPIRATORY CONDITIONS OVER EACH YEAR OF STUDY PERIOD, DO NOT CATEGORIZE FOR NOW

- Vital signs at encounter (RESP, TEMP initially; BMI, PULSE perhaps at a later date) SpO2 (hypoxemia = SpO2<90%) - USE TOTAL COUNTS OF ICD9/10 CODES FOR HYPOXEMIA OVER EACH YEAR OF STUDY PERIOD FOR NOW ICD9 | 799.02 | hypoxemia ICD10 | R09.02 | hypoxemia Body temperature (fever if >=101°F) (note that 'fever' does not have a standard definition) - USE TOTAL COUNTS OF ICD9/10 CODES FOR FEVER OVER EACH YEAR OF STUDY PERIOD FOR NOW ICD9 | 780.60 | fever, unspecified ICD10 | R50.9 | fever, unspecified

- Medications at encounter - USE TOTAL COUNTS OF PRESCRIBED OR ADMINISTERED MEDICATIONS (E.G., 'PREDNISONE') OVER EACH YEAR OF STUDY PERIOD FOR NOW **Note that medications after year 2014 are standardized to RxNorm prednisone fluticasone mometasone budesonide beclomethasone ciclesonide flunisolide albuterol metaproterenol diphenhydramine fexofenadine cetirizine ipratropium salmeterol arformoterol formoterol indacaterol theophylline omalizumab mepolizumab

- ICD codes at encounter, filtered for diagnoses of respiratory conditions, truncated for major term only - USE TOTAL COUNTS OF EACH MAJOR ICD9/10 CODE FOR RELEVANT CONDITIONS (i.e., ASTHMA, CROUP, REACTIVE AIRWAY, COUGH, PNEUMONIA) OVER EACH YEAR OF STUDY PERIOD FOR NOW ICD9 | 493.% | asthma ICD10 | J45.% | asthma ICD9 | 464.% | croup ICD10 | J05.% | croup ICD9 | 496.% | reactive airway ICD10 | J44.% | reactive airway ICD10 | J66.% | reactive airway ICD9 | 786.% | cough ICD10 | R05.% | cough ICD9 | 481.% | pneumonia ICD9 | 482.% | pneumonia ICD9 | 483.% | pneumonia ICD9 | 484.% | pneumonia ICD9 | 485.% | pneumonia ICD9 | 486.% | pneumonia ICD10 | J12.% | pneumonia ICD10 | J13.% | pneumonia ICD10 | J14.% | pneumonia ICD10 | J15.% | pneumonia ICD10 | J16.% | pneumonia ICD10 | J17.% | pneumonia ICD10 | J18.% | pneumonia

- ICD codes for obesity - USE TOTAL COUNTS OF RELEVANT ICD9/10 CODES FOR OBESITY OVER EACH YEAR OF STUDY PERIOD FOR NOW ICD9 | 278.00 | obesity ICD10 | E66.% (except E66.3) | obesity

- LOINC codes at encounter, filtered for measurements of plasma/serum IgE, circulating eosinophils, circulating neutrophils, and plasma/serum CRP; flags used to identify "high" values - USE TOTAL COUNTS OF EACH LOINC CODE OVER EACH YEAR OF STUDY PERIOD FOR NOW 33536-4 | Miscellaneous allergen IgE Ab RAST class [Presence] in Serum 13834-7 | Total IgE [Units/volume] in Serum by Radioallergosorbent test (RAST) 26449-9 | Eosinophils [#/volume] in Blood 711-2 | Eosinophils [#/volume] in Blood by Automated count 712-0 | Eosinophils [#/volume] in Blood by Manual count 26450-7 | Eosinophils/100 leukocytes in Blood 713-8 | Eosinophils/100 leukocytes in Blood by Automated count 714-6 | Eosinophils/100 leukocytes in Blood by Manual count 26499-4 | Neutrophils [#/volume] in Blood 751-8 | Neutrophils [#/volume] in Blood by Automated count 753-4 | Neutrophils [#/volume] in Blood by Manual count 26511-6 | Neutrophils/100 leukocytes in Blood 770-8 | Neutrophils/100 leukocytes in Blood by Automated count 23761-0 | Neutrophils/100 leukocytes in Blood by Manual count 1988-5 | C reactive protein [Mass/volume] in Serum or Plasma 30522-7 | C reactive protein [Mass/volume] in Serum or Plasma by High sensitivity method 11039-5 | C reactive protein [Presence] in Serum or Plasma 35648-5 | C reactive protein [Quintile] in Serum or Plasma by High sensitivity method 76485-2 | C reactive protein [Moles/volume] in Serum or Plasma 76486-0 | C reactive protein [Moles/volume] in Serum or Plasma by High sensitivity method 14634-0 | C reactive protein [Titer] in Serum or Plasma 71426-1 | C reactive protein [Mass/volume] in Blood by High sensitivity method

[ ] D3. A simplified table TV1a for use by Green Team during hackathon to begin to explore (state)trait stratification of patients with asthma-like conditions. (See Dropbox/DataTranslatorProject/Slides/TranslatorOverview_v4b 12.22.17.)

Plan is to: (1) use above 'study' design and variables/variable treatment to perform unsupervised clustering during hackathon (Kimberly/Hao) (2) develop 'rules' to perform expert-based trait stratification after hackathon (Kara); implement rules (Kimberly/Hao if implementation is in Python, Kara if implementation is in a standard statistical software package such as SPSS or SAS JMP); share rules with NIEHS (Kara) (3) generate clinically meaningful queries that better address time, such as: "what patient features (or clinical variables) 'predict' (or are associated with) the number of ED visits for respiratory conditions over the year after primary diagnosis for a respiratory condition" (Kara); implementation approach and owner TBD

krobasky commented 6 years ago

Very helpful, thanks Kara! The details in M6 were intended to fall under M3; and I had mistakenly marked this as "done" based on a previous email; I'm very glad to have the additional detail added here.

Some remaining items:

  1. LOINC CODES:

a. What to do about severely imbalanced in record-counts? Here are the tallies I found for the above LOINC codes in the Hush+v1 data. We may find more in the v2, but I would suggest a strategy for dealing with imbalances. For example, many LOINCs are missing (indicating a cel-size of <11 records), I would suggest omitting all LOINCs in the bottom quartile, but we can try as-is and see what happens, if that's preferred.

Tally Code Description 832 13834-7 Total IgE

13461 1988-5 C reactive protein [Mass/volume] in Serum or Plasma C reactive protein 729 30522-7 C reactive protein [Mass/volume] in Serum or Plasma by High sensitivity method

52228 26449-9 Eosinophils[#/volume] in blood 33729 26450-7 Eosinophils/100 leukocytes in blood 102622 711-2 Eosinophils [#/volume] in Blood by Automated count

73301 26499-4 Neutrophils [#/volume] 33840 26511-6 Neutrophils/100 leukocytes in Blood

20 714-6 Eosinophils/100 leukocytes in Blood by Manual count 121398 751-8 Eosinophils/100 leukocytes in Blood by Automated count

b. How do we harmonize units?

We need a strategy for harmonizing units. Among the LOINCS that appear in the HUSH+v1 records, the units follow:

tally code min,max,mean,3*std flags units 832 LOINC:13834-7 2.00,5000.00,219.34,1611.18 [H],[L],[A], kU/L,IU/mL,

13461 LOINC:1988-5 0.00,450.00,10.05,91.49 [H],H, MG/DL,mg/L,mg/dL,MG/L, 729 LOINC:30522-7 0.10,243.30,5.77,30.38 MG/L,

52228 LOINC:26449-9 0.00,15.90,0.20,1.04 [H],H, x10 9th/L,109/L,/µL,X10E3/uL, 33729 LOINC:26450-7 0.00,9999999.00,299.01,163347.67 [H],[L],[A], %,L, 102622 LOINC:711-2 0.00,343.00,0.18,4.70 [H],[A], 109/L,10^3u/L,x10E3/uL,K/uL,/µL,x10(3)/mcL,

73301 LOINC:26499-4 0.00,122.30,5.29,12.91 H,CL,L, x10 9th/L, 33840 LOINC:26511-6 0.00,9999999.00,359.98,163078.55 [H],[L],[A], %,H,L,

20 LOINC:714-6 0.00,18.00,3.40,12.99 [H], %, 121398 LOINC:751-8 0.00,6370.00,5.27,77.14 [L],[H],[A],L,H,CL, k/uL,10*9/L,10^3u/L,x10E3/uL,K/uL,/µL,x10 9th/L,x10(3)/mcL,H,L,X10E3/uL,

This may be helpful (from HUSH+v1): ftp://ftp.renci.org/outgoing/LOINCS-2017-12-04.tsv

  1. Vital signs How to get these from the HUSH+ data?
karafecho commented 6 years ago

Response to 1b above:

For the LOINC codes listed here, I'd consider any entry labeled as 'H' or '[H]' as above range.

Re the lack of flags for certain LOINC codes, I suspect that there isn't an accepted 'normal range' or 'normal reading' for these assays. A quick pass through ftp://ftp.renci.org/outgoing/LOINCS-2017-12-04.tsv suggests that this might be the case, but we should definitely investigate the issue further. As an FYI, a number of LOINC codes are for experimental/research assays, sometimes indicated as such, sometimes not.

Update:

Per Emily, In addition to the two points noted in my response to 1b above, another reason for missing flags is the fact that external labs do not always include flags when returning lab results to UNC, even when normal ranges and readings exist.

Also, the missing LOINC codes (for labs, tests and procedures) relate, in part, to the fact that these are being phased in (more or less) as part of the CDWH initiative. (As an FYI, LOINC codes, unlike CPT codes, cannot be used for billing and thus are not captured by EPIC.)

krobasky commented 6 years ago

Following are the medication mappings (also useful for one of the pitches):

Medication names (also listed in this thread above): /projects/datatrans/krobasky/hackathonJan2018/mednames RxNorm ids for medication names: /projects/datatrans/krobasky/hackathonJan2018/map.medname-rxnorm.tsv Gene names for the medication names: /projects/datatrans/krobasky/hackathonJan2018/map.medname-genename.tsv MDCTN codes (from hushplus) for the medication names: /projects/datatrans/krobasky/hackathonJan2018/map.medname-MDCTN.tsv

Note re provenance of original data: per Stephen Cappuzzi, http://www.dgidb.org/, no inspection or curation

xu-hao commented 6 years ago

created aggregation tool based on a spec email from @krobasky

stevencox commented 6 years ago

Hakathon over.