ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Documentation of Data Anomalies Related to Development of Binned Integrated Clinical Feature Tables #127

Closed karafecho closed 6 years ago

karafecho commented 6 years ago

This issue is to track progress on the use of clinical data to support our demonstration use case on asthma.

The priorities are:

(1) Evidence based clinical regrouping as we've designed it (PHI->CMAQ->DOT->Census->aggregates) in the EBCR slides we prepared for Green Team. (2) A machine learning-based approach to EBCR using HuSH+ patient data. (3) Same as (2) but starting with a wide-table format of fully identified data on roughly 160,000 patients and completed as part of an IRB-approved research study.

Associated tasks and responsible parties are:

1a. Clinical feature tables: James/Emily to load the identified clinical feature tables onto Rockfish. This is two years worth of data (2010, 2011) on roughly 50,000 patients for select fields/column headers only plus PHI (geocodes). Hao/James to then integrate the data with socioenvironmental data (CMAQ output for now) for subsequent de-identification and binning of variables. REQUIRED FOR EBCR SERVICE. 1b. Approval from the CDWH Oversight Committee: Related to #1a, Emily/Ashok to present our plan for the Translator EBCR Service to the CDWH Oversight Committee for approval. REQUIRED FOR TRANSLATOR EBCR SERVICE. TASKS #1a AND #1b SHOULD TAKE PLACE CONCURRENTLY.

  1. Fully identified data on roughly 160,000 patients with an asthma-like phenotype: These data have been loaded onto Rockfish, but Hao does not have observation fact tables for about 2/3 of the patients. James/Emily/Hao to investigate/resolve. Hao then to create a large wide-format table for statistical analysis, machine learning, etc as part of an IRB-approved research project.
  2. Pulmonary function test data: "Nice-to-have" data for the hackathon. James/Emily to determine if those data are available in point 2 above. If so, alert Hao to the proper data fields; if not, please release plans for making the data available.
karafecho commented 6 years ago

Received approval to move forward with 1a from CDWH Oversight Committee on April 5, 2018.

karafecho commented 6 years ago

See #123 for resolved issue relevant to task (1a).

karafecho commented 6 years ago

New data anomaly, resolved 4/12/18:

In your case, the file name is C117R040.csv. The numbers are zero-padded to three digits. Sorry, I didn't make that clear.

Hao generated those files for me to use which I have been using thus far. Like I said 2011 seems to provide me with the results I need just fine, its 2010 that I am struggling with. I haven’t tested all coordinated just yet mainly because its very time consuming.

I don't know what those csv files are. Where did you get them from?

That’s is exactly the result I get. However, if you take it a step further and look in the folder /opt/RENCI/output/cmaq2010, based on those results you would expect to find a file named C117R40.csv which does not exist. That’s what I am trying to explain. Does that make sense?

When I run your first params (latlon2rowcol(35.711653,-78.81965, "2010”) ) I get row=40 col=117. Is that what you get? Remember you will get different results for the same points in the 2010 and 2011 data because they are different resolutions.

Basically what I have done to try to isolate the issue is to run the python script independently by hardcoding the parameters required for the latlon2rowcol function. I passed in valid NC coordinates for years 2010 and 2011. When running for 2011, from what I can see has returned results (row_nu, col_no) that correspond to an existing file living in /opt/RENCI/output/cmaq2011. When I do the same for 2010, it does return what look like 2 valid positive integer values for row_nu, col_no, but does not correspond to an existing file for/opt/RENCI/output/cmaq2010. If reading this explanation made you dizzy in any way, I’m happy to discuss over the phone.

James, what kind of values are returned for the example you gave? Can you also give me examples of what worked? Did any of the 2010 values work for you? If not how about an example of 2011 that worked for you?

I have been using the python script you provided me with for mapping lat long coordinates to the corresponding cmaq data file (/opt/RENCI/output/cmaq20*/CR*.csv). Seems to work fine for mapping coordinates for 2011 but am not having any luck with 2010. Here are some examples of parameters used to generate column and row values that did not map to a file:

latlon2rowcol(35.711653,-78.81965, "2010") latlon2rowcol(35.7324013686701, -78.536003683545, "2010") latlon2rowcol(36.023611, -79.366835, "2010") latlon2rowcol(36.0917376315789, -79.1110542368421,"2010")

karafecho commented 6 years ago

@jameschump @empfff : Please save a copy of the fully identified, integrated clinical feature table on Rockfish before you de-identify the data, just so we can add new clinical features that require geocodes/dates for integration as they become available. Emily and I discussed this earlier today, so please reach out to her for explanation (and I suspect she already contacted you). Thank you!

karafecho commented 6 years ago

@xu-hao @jameschump @stevencox : Hao, when you get a chance, please send me the PM2.5 and ozone variables that you retrieved from the CMAQ output in order to generate the files that James is using for integration with the binned clinical feature vectors? Thanks!

karafecho commented 6 years ago

New data anomaly:

Geocodes (lat/lon values) for certain patients cannot be mapped to CMAQ output, presumably because those patients have their primary residence listed as Alaska or another remote location that either was not incorporated into the CMAQ model (continental US only?) or did not have any sensors from which to derive exposures data.

karafecho commented 6 years ago

New data anomaly:

No patients were prescribed/administered mepolizumab over study period (2010-2011) because drug did not receive US FDA approval until 2015.

arunacs commented 6 years ago

So, is this a new patient dataset? Earlier, we were working with only those in the NC Triangle 3-county region.

Also, in the original complete patient dataset, we did see that there were several patients in far flung regions of the country, and some even in Hispaniola and Hawaii if I remember right. Ray created a heatmap to help visualize this early on to help us understand the spatial coverage.

And yes, the CMAQ domain covers the continental U.S. and portions of Canada and Mexico.

karafecho commented 6 years ago

@jameschump: See Sarav's comment re CMAQ coverage.

@arunacs, @stevencox: I think you are referring to the exploratory work that Kimberly and Hao did with patients in the NC Triangle region. The binned integrated clinical features tables will be developed using all patients in the CDWH with an asthma-like phenotype. Initially, we are restricted the sampling time frame to 2010 and 2011 because these are the years for which we have CMAQ data, but the goal is to expand the dataset to include multiple years. For instance, we now have socioenvironmental data (US Census ACS data) for the five-year survey sample of 2012-2016, so it makes sense to expand the service to include more recent years.