Open stevencox opened 6 years ago
Proposed architecture/approach to be shared with CDWH Oversight Committee. Will seek approval to move forward with action plan.
Clinical data needs are as follows (in descending priority):
1a. Clinical feature tables: James/Emily to load the identified clinical feature tables onto Rockfish. This is two years worth of data (2010, 2011) on roughly 50,000 patients for select fields/column headers only plus PHI (geocodes). Hao/James to then integrate the data with socioenvironmental data (CMAQ output for now) for subsequent de-identification and binning of variables. REQUIRED FOR EBCR SERVICE. 1b. Approval from the CDWH Oversight Committee: Related to #1a, Emily/Ashok to present our plan for the Translator EBCR Service to the CDWH Oversight Committee for approval. REQUIRED FOR TRANSLATOR EBCR SERVICE. TASKS #1a AND #1b SHOULD TAKE PLACE CONCURRENTLY.
@jameschump @xu-hao
Current CMAQ data needs:
@arunacs @lstillwe @xu-hao
@stevencox
@xu-hao @karafecho @lstillwe - issue #121 depends on this issue, so this one needs an earlier due date. Towards that end,
I'd like us to review development incrementally. Please help me map that out:
We're integrating:
I'd like to see a demonstration of
Are those dates workable?
Not sure about Census data - I have not heard where that data is coming from. Maybe @karafecho can comment. Going to talk about what road data to use: 4/4/18
@lstillwe it would be good to have a look at Hao's code with a view to describing the data format you need from both ROW and Census data to our KFBS collaborators, though you're probably already doing that.
As long as the data is available, integration of each takes at most 1 week. The format that I would like to have of the data is CSV with header.
@stevencox @lstillwe @xu-hao @arunacs: In preparation for our meeting on 4/4 at 2:30 pm (Dogwood), I proposed the following agenda: (1) Sarav to provide an overview of the HPMS data; (2) Sarav to provide an update on the status of the environmental ontology/chemical names; (3) Sarav to provide an update on the missing CMAQ output from 2010, 2011 (see issues in #123); (4) Lisa and Kara to provide an update on the US DOT, US Census, and Cedar Grove ROW data; (5) General discussion on IE's proposal to expand the Exposures API/Service to include nationwide CMAQ outputs over a longer-term time-series to include additional years and/or near-road exposures for specific additional states/metro areas; (6) General discussion on access to additional CMAQ output (beyond PM2.5 and ozone) via Rockfish; and (7) General discussion on plan of action (including timeline) for moving forward with both IE and Cedar Grove (this related to items #4 and #5), especially as the plans relate to our goals for the May hackathon, which we should define and agree to.
@stevencox @lstillwe @xu-hao @arunacs: Correction: meeting date is 4/11 at 2:30 pm (Dogwood). Regardless, if @stevencox, @lstillwe , and @xu-hao are free to meet tomorrow, 4/4 at 2:30 (Dogwood), I think that would be a good idea.
I met with @lstillwe this afternoon. I will be available tomorrow.
Received approval from CDWH Oversight Committee on April 5, 2018.
@jameschump @empfff : Please save a copy of the fully identified, integrated clinical feature table on Rockfish before you de-identify the data, just so we can add new clinical features that require geocodes/dates for integration as they become available. Emily and I discussed this earlier today, so please reach out to her for explanation (and I suspect she already contacted you). Thank you!
DDCR Service was approved by the CDWH Oversight Committee on 4/5/18.
Comments forwarded by Emily on 4/11/18:
"The Committee unanimously approves this request with the following contingencies:
The Committee will require the table be encrypted at rest.
As this dataset becomes more available for the public, this Committee would like to be updated. Of note is the suggestion to limit the number of queries per IP in a specified time period.
The Committee would like to review a copy of the click-through text users will see.
The Committee would like be asked about any new data domains added to the set."
Additional comments from Emily:
Note that the # of queries/IP in a time period refers to the putting in some rules on the server that don’t allow a requesting system to hit the dataset, say, 1,000,000 times per second (suggesting a bot, of course). That’s what they meant there.
As regards the “click-through text”, they understand that that feature is not implemented yet, as there is no UI for the system. They are asking to be able to review the text once a UI is in place.
@stevencox: Please see above contingencies placed on the DDCR Service by the CDWH Oversight Committee.
4/21/18: Steve Appold delivered statewide ACS data.
4/26/2018: Kara developed a binning strategy, renamed column headers, and updated the template spreadsheets for the binned clinical feature tables. Also distributed the revised templates to Hao, Lisa, James, Steve, Emily, and Ashok.
Notes:
Next steps:
Next steps2:
WRT (1), (2), (3) above:
The integration steps should be relatively straightforward: (a) given a lat/lon, find the GEOID; (b) given a GEOID, find the corresponding ACS variable values; (c) add ACS variable values to each row of the binned clinical feature tables.
Unlike the CMAQ data, time is irrelevant for the current ACS data pull (we have data from the 2012-2016 survey sampling period, with one value per variable per lat/lon). Same with the US DOT data pull (we have data from the most recent year available, which I believe is 2017, true @lstillwe?).
Update on next steps:
@lstillwe @xu-hao @jameschump @stevencox @empfff @cbizon : Please confirm that the update above is accurate. Please also reply to any requests for status updates. Thanks!
Task (2) is complete. Task (3) is awaiting final confirmation by James that 'patient_num' is correct. Task (4) will be initiated week of 5/7/2018. ETA is 5/11/2018. Task (5): initial development of the API is complete and awaiting the final integrated clinical feature tables; data/server security issues were addressed by Chris Rutledge; Hao will work with Chris R. and other RCIADMIN staff to install a relational database (postgresql, for example) securely on the encrypted partition of the server and create a service account on it for use by the DDCR API.
Update 5/09/2018:
Complete 05.16.18.
Integrated clinical feature tables can be found on Rockfish here:
/opt/RENCI/output/ClinicalFeatureVectors/1_0_0/PatientLevel/ /opt/RENCI/output/ClinicalFeatureVectors/1_0_0/VisitLevel/
Moved to ebcr0.edc.renci.org on 05.17.18.
Public access: ddcr.renci.org
Investigating issue related to TotalEDInpatientVisits. James will double-check the code used to created the integrated files and to verify that all column headers are accurate.
6/11/18: James fixed patient-level tables and sent them to Hao. Investigating missing data for visit-level tables.
6/15/2018: James generated new tables, double-checked column headers, conducted unit tests, and sent the tables to Hao. Hao will be loading the new tables to allow access through the API.
@jameschump @xu-hao @empfff @stevencox Given that we now have data on roadway exposures (2017 data) and a variety of socioenvironmental exposures (2012-2016, which is the most recent ACS sampling period), we probably should discuss plans for temporally expanding the integrated clinical feature tables beyond 2010 and 2011. This will require a new data pull for patients/visits in 2012-present.
We can do that, but I unfortunately will have to recode the whole thing b/c we'll be moving from legacy data to Epic data. I don't mind doing it--but what's the timeline for needing the new files?
Dependencies:
@empfff : Let's plan to move forward with this. Does Friday, 6/13 seem like a reasonable ETA? @jameschump : Did you have a chance to troubleshoot the average daily PM2.5 and ozone exposure estimates? @stevencox @xu-hao @empfff @jameschump : I've started exploring the data will let you all know if I identify any additional anomalies. I am hoping to solicit some help with this and have reached out to Ashok to see if his summer student might be available to work on this project. If any of you feel the need to meet, please let me know.
@empfff : WRT a new data pull, we should consider adding select measures of pulmonary function (i.e., the ones that I highlighted in the doc you sent me a while back), as well as select feature variables of relevance to Fanconi anemia (i.e., metformin, WBC, HCT, PLT). The latter were identified during a discussion I had with Maureen Hoatlin, Miriam Udler, and Tyler Peryea during the May hackathon. Happy to discuss, if that would be helpful.
@karafecho when you say move forward with "this", can you clarify what the "this" is? is it a new data pull? If it's a new data pull, I can't do it by 7/13--i need more time than that. i have to re-code the whole DDCR logic for post-Epic data. good news though, PFTs are now in i2b2. should be able to add those to the data pull. i can also pull the labs you mentioned--is that list everything, or are there additional ones?
@empfff : No worries! Perhaps you can provide a more realistic ETA? Good news re PFTs. WRT the other labs, I think the list above, including the PFTs, is sufficient. For labs, is it possible to pull flags (e.g., abnormal)?
@karafecho let's say 7/27, but i will try to beat that date. Can you let me know exactly which years you want to see?
Also, for lab variables--how do you want to handle that in DDCR format? I'm assuming they'll go in the visit-level data only (doesn't make sense at the patient level)--but, there may be more than one instance of a given lab for a single visit. Do you want the "worst" value?
@empfff : For years, let's go with 2012, 2013, 2015, and 2016 for now, as we discussed during Monday's scrum call. If you have time to submit a modification so that we can pull data from 2017, great, but if not, no worries. For lab values, I could frame an argument for either the "worst" value or the most recent one. Perhaps we could capture both? For the visit-level tables, this should be pretty straightforward. For the patient-level tables, this may be more difficult but is of interest to Maureen Hoatlin and Miriam Udler. Perhaps we can capture the "worst" lab and the most recent one per patient per year? For all labs, is it possible to capture "flags"?
Please push back if any of my suggestions complicate things too much on your end.
7/6/2018: James corrected the anomaly with the PM2.5 values in the patient-level tables. Hao updated the API.
@empfff @jameschump : I thought I'd send you a quick note re additional years of patient data. As we discussed during Monday's scrum call, the CMAQ data are available from 2010-2014. 2015 CMAQ data likely will be available by the end of the year. Beyond that, IE is uncertain whether they will be able to obtain additional CMAQ estimates. So, the integration will be with the ACS data and the roadway data. I think that's fine for now.
so, @jameschump is working on linking the 2015 files to ACS and roadway... but do you want me to then nix 2016, and instead pull 2012 and 2013? @karafecho
@epfaff: The 2016 files remain of interest, but the 2012 and 2013 files are probably higher priority, given that we have CMAQ estimates for those years. So, yes, please move forward with your plan and pull the 2012 and 2013 data.
Received additional approval from CDWH Oversight Committee on 9/6/18 to: (1) work with UNC's IRB and the CDWH Operations Committee on any plans to develop new ICEES cohorts; (2) disseminate ICEES beyond the Translator program; and (3) allow a variety of programming techniques to query the ICEES API.
@jameschump @xu-hao @empfff : I'm wondering where things stand with the above, in terms of integration of the new patient files with the other data sources, binning of variables, and subsequent update of the ICEES API.
@karafecho @empfff @xu-hao, I am only seeing cmaq data for 2010 and 2011 on Rockfish so I can't move forward until 2012 and 2013 are available.
Sorry guys this is my fault - Kara asked to get the CMAQ data to you. Where can I put it so you have access to it?
Sarav - originally put it on longleaf - can you access it there?
On Sep 12, 2018, at 11:18 AM, jameschump notifications@github.com<mailto:notifications@github.com> wrote:
@karafechohttps://github.com/karafecho @empfffhttps://github.com/empfff @xu-haohttps://github.com/xu-hao, I am only seeing cmaq data for 2010 and 2011 on Rockfish so I can't move forward until 2012 and 2013 are available.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/122#issuecomment-420687618, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtgRAGGWdtRhvR6TpePkB_K7lcG759Aks5uaSW-gaJpZM4Ss8L9.
@lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?
@lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?
The format for the new files should be the same format as 2010 and 2011? I shouldn't have to change anything in order to run them right?
Thanks Hao!
Here is the current location and format information:
[lisa@iren2 new_cmaq_data]$ pwd /projects/datatrans/new_cmaq_data
[lisa@iren2 new_cmaq_data]$ ls merged* merged_cmaq_2010.csv merged_cmaq_2011.csv merged_cmaq_2012.csv merged_cmaq_2013.csv merged_cmaq_2014.csv
[lisa@iren2 new_cmaq_data]$ more merged_cmaq_2014.csv Date,FIPS,Longitude,Latitude,pm25_daily_average,pm25_daily_average_stderr,ozone_daily_8hour_maximum,ozone_daily_8hour_maximum_stderr 2014/01/01,1001020100,-86.49001,32.47718,21.444000000000003,21.3516,16.335,5.0983
Lisa
On Sep 12, 2018, at 12:28 PM, xu-hao notifications@github.com<mailto:notifications@github.com> wrote:
@lstillwehttps://github.com/lstillwe if you don't have access to rockfish, i can help you transfer the files. Could you post the format infomation so James know how to work with the new files?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/122#issuecomment-420712649, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtgRFU4lUnXPMbqvI9QotEIOH2KBORmks5uaTY8gaJpZM4Ss8L9.
@jameschump the files are uploaded to:
/opt/RENCI/merged_cmaq_201*
Please see Lisa's comment for format information
eta for integrating fihr data two weeks from now (oct/15)
Develop the tools to join clinical, chemical exposures, and socioenvironmental exposures data and the aggregation pipeline to create the data set underlying the EBCR service in support of the overall clinical feature vector hackathon goal.
@karafecho @lstillwe @cbizon