ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Develop ICEES Binned Integrated Clinical Feature Tables #122

Open stevencox opened 6 years ago

stevencox commented 6 years ago

Develop the tools to join clinical, chemical exposures, and socioenvironmental exposures data and the aggregation pipeline to create the data set underlying the EBCR service in support of the overall clinical feature vector hackathon goal.

@karafecho @lstillwe @cbizon

karafecho commented 6 years ago

Proposed architecture/approach to be shared with CDWH Oversight Committee. Will seek approval to move forward with action plan.

karafecho commented 6 years ago

Clinical data needs are as follows (in descending priority):

1a. Clinical feature tables: James/Emily to load the identified clinical feature tables onto Rockfish. This is two years worth of data (2010, 2011) on roughly 50,000 patients for select fields/column headers only plus PHI (geocodes). Hao/James to then integrate the data with socioenvironmental data (CMAQ output for now) for subsequent de-identification and binning of variables. REQUIRED FOR EBCR SERVICE. 1b. Approval from the CDWH Oversight Committee: Related to #1a, Emily/Ashok to present our plan for the Translator EBCR Service to the CDWH Oversight Committee for approval. REQUIRED FOR TRANSLATOR EBCR SERVICE. TASKS #1a AND #1b SHOULD TAKE PLACE CONCURRENTLY.

  1. Fully identified data on roughly 160,000 patients with an asthma-like phenotype: These data have been loaded onto Rockfish, but Hao does not have observation fact tables for about 2/3 of the patients. James/Emily/Hao to investigate/resolve. Hao then to create a large wide-format table for statistical analysis, machine learning, etc as part of an IRB-approved research project. This task is not relevant to this issue.
  2. Pulmonary function test data: "Nice-to-have" data for the hackathon. James/Emily to determine if those data are available in point 2 above. If so, please alert Hao to the proper data fields; if not, please release plans for making the data available. As with task 2, this task is not relevant to this issue.

@jameschump @xu-hao

karafecho commented 6 years ago

Current CMAQ data needs:

  1. Sarav to provide expanded ontology/chemical names for union of two years of data. Variables common to both years can be found here: CMAQ_Species_Defn_CheBI_Links.xlsx.
  2. Sarav to explore why data points are missing for at least several days in two-year dataset.

@arunacs @lstillwe @xu-hao

karafecho commented 6 years ago

@stevencox

stevencox commented 6 years ago

@xu-hao @karafecho @lstillwe - issue #121 depends on this issue, so this one needs an earlier due date. Towards that end,

I'd like us to review development incrementally. Please help me map that out:

We're integrating:

I'd like to see a demonstration of

Are those dates workable?

lstillwe commented 6 years ago

Not sure about Census data - I have not heard where that data is coming from. Maybe @karafecho can comment. Going to talk about what road data to use: 4/4/18

stevencox commented 6 years ago

@lstillwe it would be good to have a look at Hao's code with a view to describing the data format you need from both ROW and Census data to our KFBS collaborators, though you're probably already doing that.

xu-hao commented 6 years ago

As long as the data is available, integration of each takes at most 1 week. The format that I would like to have of the data is CSV with header.

karafecho commented 6 years ago

@stevencox @lstillwe @xu-hao @arunacs: In preparation for our meeting on 4/4 at 2:30 pm (Dogwood), I proposed the following agenda: (1) Sarav to provide an overview of the HPMS data; (2) Sarav to provide an update on the status of the environmental ontology/chemical names; (3) Sarav to provide an update on the missing CMAQ output from 2010, 2011 (see issues in #123); (4) Lisa and Kara to provide an update on the US DOT, US Census, and Cedar Grove ROW data; (5) General discussion on IE's proposal to expand the Exposures API/Service to include nationwide CMAQ outputs over a longer-term time-series to include additional years and/or near-road exposures for specific additional states/metro areas; (6) General discussion on access to additional CMAQ output (beyond PM2.5 and ozone) via Rockfish; and (7) General discussion on plan of action (including timeline) for moving forward with both IE and Cedar Grove (this related to items #4 and #5), especially as the plans relate to our goals for the May hackathon, which we should define and agree to.

karafecho commented 6 years ago

@stevencox @lstillwe @xu-hao @arunacs: Correction: meeting date is 4/11 at 2:30 pm (Dogwood). Regardless, if @stevencox, @lstillwe , and @xu-hao are free to meet tomorrow, 4/4 at 2:30 (Dogwood), I think that would be a good idea.

xu-hao commented 6 years ago

I met with @lstillwe this afternoon. I will be available tomorrow.

karafecho commented 6 years ago

Received approval from CDWH Oversight Committee on April 5, 2018.

karafecho commented 6 years ago

@jameschump @empfff : Please save a copy of the fully identified, integrated clinical feature table on Rockfish before you de-identify the data, just so we can add new clinical features that require geocodes/dates for integration as they become available. Emily and I discussed this earlier today, so please reach out to her for explanation (and I suspect she already contacted you). Thank you!

karafecho commented 6 years ago

DDCR Service was approved by the CDWH Oversight Committee on 4/5/18.

Comments forwarded by Emily on 4/11/18:

"The Committee unanimously approves this request with the following contingencies:

Additional comments from Emily:

Note that the # of queries/IP in a time period refers to the putting in some rules on the server that don’t allow a requesting system to hit the dataset, say, 1,000,000 times per second (suggesting a bot, of course). That’s what they meant there.

As regards the “click-through text”, they understand that that feature is not implemented yet, as there is no UI for the system. They are asking to be able to review the text once a UI is in place.

karafecho commented 6 years ago

@stevencox: Please see above contingencies placed on the DDCR Service by the CDWH Oversight Committee.

karafecho commented 6 years ago

4/21/18: Steve Appold delivered statewide ACS data.

4/26/2018: Kara developed a binning strategy, renamed column headers, and updated the template spreadsheets for the binned clinical feature tables. Also distributed the revised templates to Hao, Lisa, James, Steve, Emily, and Ashok.

Notes:

  1. NC statewide American Community Survey (ACS) data (nationwide data may be available before the hackathon);
  2. An updated patient-level template for the binned clinical feature tables (includes the variables in [1] and a feature variable for roadway exposure); and
  3. An updated visit-level template for the binned clinical feature tables (includes the variables in [1] and a feature variable for roadway exposure).
karafecho commented 6 years ago

Next steps:

  1. Lisa will develop a function that will return a US Census GOEID given a latitude/longitude value (modified per @lstillwe);
  2. Hao will develop code to integrate the ACS and US DOT (roadway) data with the patient- and visit-level binned clinical feature tables;
  3. James will perform the actual integration and troubleshoot (after this step, the fully integrated, binned clinical feature tables will be complete); and
  4. Development work on the DDCR API can begin. (Or, should this start now @stevencox?)
karafecho commented 6 years ago

Next steps2:

  1. Lisa is developing code that will return a US Census GEOID given a latitude/longitude value (ETA 5/2);
  2. Hao is developing code to integrate the ACS and US DOT (roadway) data with the patient- and visit-level binned clinical feature tables that James/Emily have created (meeting to review data on 5/2);
  3. James will perform the actual integration of the integrated ACS/USDOT/CMAQ/CDWH data and troubleshoot (CMAQ/CDWH data integration is complete; after the final integration step, the fully integrated, binned clinical feature tables will be complete);
  4. Hao has begun development work on the DDCR Service API; and
  5. Chris Rutledge is working with Steve and Hao to address the Oversight Committee requirement of encryption at rest, as well the suggestion to block service requests of, say, 1,000,000 requests per sec per IP address.
karafecho commented 6 years ago

WRT (1), (2), (3) above:

The integration steps should be relatively straightforward: (a) given a lat/lon, find the GEOID; (b) given a GEOID, find the corresponding ACS variable values; (c) add ACS variable values to each row of the binned clinical feature tables.

Unlike the CMAQ data, time is irrelevant for the current ACS data pull (we have data from the 2012-2016 survey sampling period, with one value per variable per lat/lon). Same with the US DOT data pull (we have data from the most recent year available, which I believe is 2017, true @lstillwe?).

karafecho commented 6 years ago

Update on next steps:

  1. Lisa is developing code that will return a US Census GEOID given a latitude/longitude value (ETA 5/2) - STATUS UPDATE IS REQUESTED
  2. Hao is developing code to integrate the ACS and ROADWAY data with the patient- and visit-level binned clinical feature tables that James/Emily have created - CODE FOR ROADWAY/CLINICAL INTEGRATION IS COMPLETE, STATUS UPDATE ON CODE FOR ACS/CLINICAL INTEGRATION IS REQUESTED
  3. James will confirm whether the patient_num variable is the correct one to use for integration of the ACS and Roadway data with the patient- and visit-level binned clinical features table - STATUS UPDATE IS REQUESTED
  4. James will perform the actual integration of the integrated ACS/ROADWAY/CMAQ/CDWH data and troubleshoot (CMAQ/CDWH data integration is complete; after the final integration step, the fully integrated, binned clinical feature tables will be complete) - STATUS UPDATE IS REQUESTED
  5. Hao has begun development work on the DDCR Service API - DEVELOPMENT HAS BEEN INITIATED, STATUS UPDATE IS REQUESTED
  6. Chris Rutledge is working with Steve and Hao to address the Oversight Committee requirement of encryption at rest/transit (SSL), as well the suggestion to block service requests of greater than 10 per sec per IP address - DONE

@lstillwe @xu-hao @jameschump @stevencox @empfff @cbizon : Please confirm that the update above is accurate. Please also reply to any requests for status updates. Thanks!

lstillwe commented 6 years ago
  1. has been completed as of 5/2/18.
karafecho commented 6 years ago

Task (2) is complete. Task (3) is awaiting final confirmation by James that 'patient_num' is correct. Task (4) will be initiated week of 5/7/2018. ETA is 5/11/2018. Task (5): initial development of the API is complete and awaiting the final integrated clinical feature tables; data/server security issues were addressed by Chris Rutledge; Hao will work with Chris R. and other RCIADMIN staff to install a relational database (postgresql, for example) securely on the encrypted partition of the server and create a service account on it for use by the DDCR API.

karafecho commented 6 years ago

Update 5/09/2018:

  1. The 'patient_num' variable is not the correct one. The correct 'patient_sk' variable and associated lat/long values can be found in the two patient-level tables that Emily generated for James. James, will send Hao the correct path to the files on Rockfish where the 'patient_sk' variable and associated lat/long values can be found. Hao, will generate new ACS and ROADWAY files for James, but will first see point (2) below. - UPDATE FROM JAMES: /home/champioj/PatientLvl_2010.txt and /home/champioj/PatientLvl_2011.txt; SECOND UPDATE FROM JAMES: The following files were moved to a directory that Hao has permission to access: /opt/RENCI/patient_sk; New files including those from (4) below are: PatientLvl_2010.txt, PatientLvl_2011.txt, VisitLvl_2010.txt, VisitLvl_2010_wPtSK.txt, VisitLvl_2011.txt, VisitLvl_2011_wPtSK.txt UPDATE FROM HAO: New ACS and ROADWAY files with the correct 'patient_sk' identier can be found on Rockfish here: /opt/RENCI/output/acs_psk.csv and /opt/RENCI/output/nearestroad_psk.csv.
  2. The ACS data that were returned to James are corrupt. I something went wrong when the original Excel file was converted to CSV in Linux. I've attached the original Excel file that I received from Steve Appold and a new CSV file that I exported directly from Excel. Hao, will use the new CSV file to accomplish (1) above, but will first confirm that the file is not corrupt. - UPDATE FROM JAMES: The one I spoke to Kara about was the ACS file she emailed me as an attachment, not the one you sent me the path to on Rockfish located in /opt/RENCI/output/acs.csv
  3. To ensure that everyone is on the same page, Lisa will point us to the code that you generated to convert lat/long values to GEOID values and send the path to the ROADWAY data on Rockfish. - UPDATE FROM LISA: Here is the location of the code for converting lat, lon to GEOID using census block groups: UPDATE FROM HAO: ROADWAY data can be found here: /opt/RENCI/tl_2015_allstates_prisecroads_lcc https://github.com/lstillwe/datatrans/tree/master/spark/src/main/java/datatrans; I do not know the location of the roads data on Rockfish, since I was never able to get access to Rockfish.
  4. James will work with Emily to add 'patient_sk' identifiers and associated lat/long values to the visit identifiers in the visit-level tables that Emily generated. These will be the same 'patient_sk' identifiers and associated lat/long values as in the patient-level tables, so Hao will not need to generate new ACS and ROADWAY files in order for you to integrate ACS and ROADWAY data with the visit-level tables. - UPDATE FROM JAMES: - Emily generated new visit-level tables that include the correct 'patient_sk' identifiers and associated lat/long values
karafecho commented 6 years ago

Complete 05.16.18.

Integrated clinical feature tables can be found on Rockfish here:

/opt/RENCI/output/ClinicalFeatureVectors/1_0_0/PatientLevel/ /opt/RENCI/output/ClinicalFeatureVectors/1_0_0/VisitLevel/

karafecho commented 6 years ago

Moved to ebcr0.edc.renci.org on 05.17.18.

karafecho commented 6 years ago

Public access: ddcr.renci.org

karafecho commented 6 years ago

Investigating issue related to TotalEDInpatientVisits. James will double-check the code used to created the integrated files and to verify that all column headers are accurate.

karafecho commented 6 years ago

6/11/18: James fixed patient-level tables and sent them to Hao. Investigating missing data for visit-level tables.

karafecho commented 6 years ago

6/15/2018: James generated new tables, double-checked column headers, conducted unit tests, and sent the tables to Hao. Hao will be loading the new tables to allow access through the API.

@jameschump @xu-hao @empfff @stevencox Given that we now have data on roadway exposures (2017 data) and a variety of socioenvironmental exposures (2012-2016, which is the most recent ACS sampling period), we probably should discuss plans for temporally expanding the integrated clinical feature tables beyond 2010 and 2011. This will require a new data pull for patients/visits in 2012-present.

empfff commented 6 years ago

We can do that, but I unfortunately will have to recode the whole thing b/c we'll be moving from legacy data to Epic data. I don't mind doing it--but what's the timeline for needing the new files?

xu-hao commented 6 years ago

Dependencies:

karafecho commented 6 years ago

@empfff : Let's plan to move forward with this. Does Friday, 6/13 seem like a reasonable ETA? @jameschump : Did you have a chance to troubleshoot the average daily PM2.5 and ozone exposure estimates? @stevencox @xu-hao @empfff @jameschump : I've started exploring the data will let you all know if I identify any additional anomalies. I am hoping to solicit some help with this and have reached out to Ashok to see if his summer student might be available to work on this project. If any of you feel the need to meet, please let me know.

karafecho commented 6 years ago

@empfff : WRT a new data pull, we should consider adding select measures of pulmonary function (i.e., the ones that I highlighted in the doc you sent me a while back), as well as select feature variables of relevance to Fanconi anemia (i.e., metformin, WBC, HCT, PLT). The latter were identified during a discussion I had with Maureen Hoatlin, Miriam Udler, and Tyler Peryea during the May hackathon. Happy to discuss, if that would be helpful.

empfff commented 6 years ago

@karafecho when you say move forward with "this", can you clarify what the "this" is? is it a new data pull? If it's a new data pull, I can't do it by 7/13--i need more time than that. i have to re-code the whole DDCR logic for post-Epic data. good news though, PFTs are now in i2b2. should be able to add those to the data pull. i can also pull the labs you mentioned--is that list everything, or are there additional ones?

karafecho commented 6 years ago

@empfff : No worries! Perhaps you can provide a more realistic ETA? Good news re PFTs. WRT the other labs, I think the list above, including the PFTs, is sufficient. For labs, is it possible to pull flags (e.g., abnormal)?

empfff commented 6 years ago

@karafecho let's say 7/27, but i will try to beat that date. Can you let me know exactly which years you want to see?

Also, for lab variables--how do you want to handle that in DDCR format? I'm assuming they'll go in the visit-level data only (doesn't make sense at the patient level)--but, there may be more than one instance of a given lab for a single visit. Do you want the "worst" value?

karafecho commented 5 years ago

@empfff : For years, let's go with 2012, 2013, 2015, and 2016 for now, as we discussed during Monday's scrum call. If you have time to submit a modification so that we can pull data from 2017, great, but if not, no worries. For lab values, I could frame an argument for either the "worst" value or the most recent one. Perhaps we could capture both? For the visit-level tables, this should be pretty straightforward. For the patient-level tables, this may be more difficult but is of interest to Maureen Hoatlin and Miriam Udler. Perhaps we can capture the "worst" lab and the most recent one per patient per year? For all labs, is it possible to capture "flags"?

Please push back if any of my suggestions complicate things too much on your end.

karafecho commented 5 years ago

7/6/2018: James corrected the anomaly with the PM2.5 values in the patient-level tables. Hao updated the API.

karafecho commented 5 years ago

@empfff @jameschump : I thought I'd send you a quick note re additional years of patient data. As we discussed during Monday's scrum call, the CMAQ data are available from 2010-2014. 2015 CMAQ data likely will be available by the end of the year. Beyond that, IE is uncertain whether they will be able to obtain additional CMAQ estimates. So, the integration will be with the ACS data and the roadway data. I think that's fine for now.

empfff commented 5 years ago

so, @jameschump is working on linking the 2015 files to ACS and roadway... but do you want me to then nix 2016, and instead pull 2012 and 2013? @karafecho

karafecho commented 5 years ago

@epfaff: The 2016 files remain of interest, but the 2012 and 2013 files are probably higher priority, given that we have CMAQ estimates for those years. So, yes, please move forward with your plan and pull the 2012 and 2013 data.

karafecho commented 5 years ago

Received additional approval from CDWH Oversight Committee on 9/6/18 to: (1) work with UNC's IRB and the CDWH Operations Committee on any plans to develop new ICEES cohorts; (2) disseminate ICEES beyond the Translator program; and (3) allow a variety of programming techniques to query the ICEES API.

karafecho commented 5 years ago

@jameschump @xu-hao @empfff : I'm wondering where things stand with the above, in terms of integration of the new patient files with the other data sources, binning of variables, and subsequent update of the ICEES API.

jameschump commented 5 years ago

@karafecho @empfff @xu-hao, I am only seeing cmaq data for 2010 and 2011 on Rockfish so I can't move forward until 2012 and 2013 are available.

lstillwe commented 5 years ago

Sorry guys this is my fault - Kara asked to get the CMAQ data to you. Where can I put it so you have access to it?

Sarav - originally put it on longleaf - can you access it there?

On Sep 12, 2018, at 11:18 AM, jameschump notifications@github.com<mailto:notifications@github.com> wrote:

@karafechohttps://github.com/karafecho @empfffhttps://github.com/empfff @xu-haohttps://github.com/xu-hao, I am only seeing cmaq data for 2010 and 2011 on Rockfish so I can't move forward until 2012 and 2013 are available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/122#issuecomment-420687618, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtgRAGGWdtRhvR6TpePkB_K7lcG759Aks5uaSW-gaJpZM4Ss8L9.

xu-hao commented 5 years ago

@lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?

jameschump commented 5 years ago

@lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?

The format for the new files should be the same format as 2010 and 2011? I shouldn't have to change anything in order to run them right?

lstillwe commented 5 years ago

Thanks Hao!

Here is the current location and format information:

[lisa@iren2 new_cmaq_data]$ pwd /projects/datatrans/new_cmaq_data

[lisa@iren2 new_cmaq_data]$ ls merged* merged_cmaq_2010.csv merged_cmaq_2011.csv merged_cmaq_2012.csv merged_cmaq_2013.csv merged_cmaq_2014.csv

[lisa@iren2 new_cmaq_data]$ more merged_cmaq_2014.csv Date,FIPS,Longitude,Latitude,pm25_daily_average,pm25_daily_average_stderr,ozone_daily_8hour_maximum,ozone_daily_8hour_maximum_stderr 2014/01/01,1001020100,-86.49001,32.47718,21.444000000000003,21.3516,16.335,5.0983

Lisa

On Sep 12, 2018, at 12:28 PM, xu-hao notifications@github.com<mailto:notifications@github.com> wrote:

@lstillwehttps://github.com/lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/122#issuecomment-420712649, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtgRFU4lUnXPMbqvI9QotEIOH2KBORmks5uaTY8gaJpZM4Ss8L9.

xu-hao commented 5 years ago

@jameschump the files are uploaded to:

/opt/RENCI/merged_cmaq_201*

Please see Lisa's comment for format information

xu-hao commented 5 years ago

eta for integrating fihr data two weeks from now (oct/15)