Open karafecho opened 6 years ago
Preliminary meetings have taken place and initial scientific and technical use cases have been developed. ROBOKOP code and data sources are being explored.
Meeting with Charles, Shepherd, and Christine scheduled for 1 pm, Wednesday, April 18.
Data usage agreement signed by NIEHS with redlining. Sent back to UNC and awaiting acceptance/modifications in order to obtain hush+ data
Stood-up backend service for chemical similarity service. Now working on smart API to system. Then will add it to Rosetta.
DUA accepted by NIEHS. HuSH+ dataset received.
Charles Schmitt's team is planning to develop a workflow to address toxic agent-centric questions such as those listed here. Relevant NIEHS data sources will be identified for eventual exposure as a Translator ChemTox Smart API. Resources are limited, however, so this is a low-priority project for NIEHS.
7/20/2018 - Alex, Steve, Chris, and I met with Charles' Tox group and Data Science group to provide an overview of the Translator program and Green/Gamma's role in the program. We requested their assistance with Module 3, MVP1.
@cpschmitt : Please provide an update.
08/02/18 - Several updates:
On the clinical side, we're submitting a request this week to the EPR contract team to pull genotype data and linking hashcodes. The geneotype data will be limited for now to TLR4 related SNPS. This will allow us to include the genotype data into the UNC clinical service and validate prior findings on TLR4 association with ashma and distance to roadway (the paper on this from Shepherd was just published). The linking hashcodes are those that UNC has on their own paper through the work that David Borland has done with Tracs on linking UNC and EPR patients (algorithm from Ashok).
On the clinical side, we are developing a second use case around immune mediated diseases with Dr. Fred Miller at NIEHS. I'll work with Kara on the formulation of the use case. Presumably we would take the same approach as with Asthma to augment the UNC clinical service.
Also, the EPR contract team is starting to draft documents with UNC for a more general data sharing arrangement between UNC and NIEHS based on several prior conversations with UNC. Dave Peden has agreed to serve as the UNC PI on this.
On the tox side, Resham Kulkarni is taking a first pass at the chemical to tox phenotype use case that Scott Auerbach had mentioned and how it relates to the Translator and the module 3. She should have that this week. I'll work on it next, then have Scott review it. Then we'll run it by the Green team.
09/06/18 update
From Charles:
I'll have Sue Nolte get a renci account (if someone can tell me where we request a renci account) and install tox21 enricher on a server at RENCI. We should also have a phone call to briefly walk through the database and source code.
I'll investigate which of the annotation classes within enricher have linkages to other ontologies/terminologies and try to find any more documentation on the annotation classes (in particular drug matrix).
One of us will take the lead on defining an API. I'm happy to start this if you want, but I'm equally happy to review one
RENCI will implement the API
Independent of this, I'll be looking at AOPs closer for NTP's purposes and will let you know if there's a way to link those into translator (although I'd encourage you to also look if you haven't).
From Kara:
Consulted with Charles on (1) plans to generate a new ICEES cohort (i.e., new tables) using the EPR, which includes an asthma sub-cohort; and (2) a new EPR use case on immune-mediated disease
Update on Green/Gamma collaboration with NIEHS (Charles Schmitt), 11/2/2018:
Not discussed, but referred to during the meeting: Ilya Baldin’s NSF ImPACT award and ICEES as a use case for secure multiparty computation
Modification to approved Green Team IRB protocol on asthma-like patients:
Goal: We seek to add additional patient level data elements to inform the existing research study. The new data elements will include a limited set of genotype calls and responses to two survey questions (EPR Health and Exposure Survey, EPR Exposome Survey) that relate to the current Asthma use case. The data elements are available from the Environmental Polymorphism Registry (EPR), a longitudinal research study being conducted by the NIEHS. The EPR has enrolled around 19,000 subjects in North Carolina which includes approximately 5000 subjects who are in the UNC CDW-H (as determined from a prior UNC-NIEHS study). We plan to add data elements from the EPR to only those subjects that are in the EPR and are in the cohort for this study.
The genotype data is focused on a small set of variants (4 total) that relate to the TLR4 pathway. The survey questions include a broad range of questions related to environmental exposures and general health. In a prior EPR-based study, we found significant differences in suspectibility to Asthma for patients based on distance to roadways and based upon their genotype for these variants. Under this modification, we will extend the existing Asthma analysis to include these genotypes and attempt to confirm the prior findings. We also plan to incorporate the survey questions and genotype data into the existing study analysis in order to uncover potential relationships between Asthma outcomes, clinical data, environmental exposure measures, survey responses, and genotypes. We note that this work is only meant to extend the set of data elements in the existing study and not to pursue additional research goals.
Safeguards: In prior work between UNC and EPR, we have developed a method to link records from patients seen at both institutions based upon the use of one way hash codes. The hash codes are derived from identifying information (age, sex, name, address), although once computed the identifying information is no longer needed and cannot be regenerated from the hash codes. We plan to use the hash codes to allow linking the existing UNC study data with the EPR data elements. As UNC and EPR share identical hash codes, we can add the EPR data elements to existing de-identified study data at UNC, thus the EPR data elements will not be added to identified patient records and the NIEHS will not to transfer identifying information to UNC. EPR will provide the data elements with the hash codes using secure file transfer.
@schmittcp will work with @lstillwe and Sue Nolte to develop an API for Tox21 Enricher data. Charles, please send Sue's GitHub user name to @rayi113 so that he can add her to this repo.
@schmittcp @szcc @lstillwe : I'm hoping that the three of you can coordinate on the Tox21 Enricher API.
Update, 4/19/19:
Update on status of ongoing projects:
Replicate NIEHS EPR study using shared patients in EPR and Translator asthma-like cohort (Charles, Kara)
Add EPR data to new ICEES tables for years 2010-2016, to be generated using CAMP FHIR PCORnet-> FHIR data conversion pipeline and FHIR PIT data integration pipeline (Kara, Emily, Hao)
Add EPR flags to overlapping patients in UNC i2b2
Analysis agreement between UNC ICD9/10 codes and EPR self-reported diseases
Develop a standard procedure for cross UNC-NIEHS studies
ROBOKOP exploration of chemical-disease IMD clusters
ROBOKOP exploration of chemical-gene IMD clusters
Independent batch pull of exposures data, for integration with EPR data
EHP manuscript is in preparation
Latest cross walk data for EPR data results in
The difference between these two data sets include:
which are excluded from icees table
We have some Tox21 APIs but currently there are behind firewall. We'll make it public available asap.
On Thu, Nov 28, 2019 at 1:20 PM xu-hao notifications@github.com wrote:
Latest cross walk data for EPR data results in
- 213 matches with 2012 fhir data and
- 42 matches with icees table.
The difference between these two data sets include:
- rows with no lat, lon
- rows with age >= 90
which are excluded from icees table
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/114?email_source=notifications&email_token=ABLFBAZA6WD33JQM3GPUCT3QWADWPA5CNFSM4ETDL6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNHY6Y#issuecomment-559578235, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLFBAZ3TENNTMNIQDI2ZY3QWADWPANCNFSM4ETDL6EA .
@szcc : Sounds great.
Forgive my ignorance, but I don't recognize your user name. Perhaps you can remind me?
Sue Nolte from NIEHS
On Mon, Dec 2, 2019 at 10:46 AM karafecho notifications@github.com wrote:
@szcc https://github.com/szcc : Sounds great.
Forgive my ignorance, but I don't recognize your user name. Perhaps you can remind me?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/114?email_source=notifications&email_token=ABLFBA6S66BYYAUIQ6L5UQLQWUUU7A5CNFSM4ETDL6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFT5LNA#issuecomment-560453044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLFBAYATPPEDQ73JKO2T5DQWUUU7ANCNFSM4ETDL6EA .
👍
Emily to resolve hashing issues the week of January 13, 2020.
@xu-hao : I checked with the EPR folks regarding the EPR "HE_COMPLETION_DATE" variable. This is the variable we should use for integration. So, we should examine exposures during the year prior to survey completion for the CMAQ airborne exposures data. For the ACS socio-economic data, we should use the 5-year block in which the survey was completed. For the roadway data, 2016 is the only option. Does this seem reasonable? Will this present any major challenges?
@xu-hao : To clarify the above plan, we currently have ICEES structured in a somewhat random manner. Specifically, for patient tables, ages are calculated with respect to January 1 of each one-year 'study' period, and exposures and outcomes are examined over the same one-year 'study' period. For visit tables, ages are calculated with respect to visit date, exposures are examined over the 24 hours prior to visit date, and outcomes are examined with respect to the visit itself.
The EPR data are structured differently. Specifically, ages are calculated with respect to the data of survey completion ("HE_COMPLETION_DATE"), and outcomes are reported at the time of survey completion (although some of the variables refer to lifetime metrics). I think it makes the most sense to examine exposures over the one-year period prior to survey completion. If this is too challenging, or will take too much time, then we can compromise, at least for the demonstration project, and examine exposure over the year in which the survey was completed. In other words, if a participant completed the survey on 6/1/2016, then we would examine exposures over the course of 2016.
Does this make sense? If so, which of the above plans is the most feasible.
@xu-hao : let's discuss this tomorrow (Tuesday, 1/14/20).
From Emily, new hash matching results, 1/10/2020:
UNCHCS denominator: 2,770,607 patients
NIEHS EPR denominator: 19,388 participants
Matched: 7,233 people (37% of all EPR participants)
From Emily, 1/14/2020:
Crosswalk is up on Rockfish, in /opt/RENCI/output/FHIR. Filename is UNC_NIEHS_XWalk_for_Hao.csv. The UNC identifier should match the patient IDs you have the in the FHIR files. The hash is what will match with NIEHS.
Update from Kara on EPR asthma cohort data, 01.16.20:
N= 4129 total, all with SNP data
N= 2709 with HE_COMPLETION_DATE Of those 2709, 4 participants have HE_COMPLETION_DATE in 2014, and 5 have HE_COMPLETION_DATE in 2018, remainder are all 2012 and 2013
N = 2637 with HE_COMPLETION_DATE and D28_Asthma = 0,1 Of those 2637, 1 participant has HE_COMPLETION_DATE in 2014, remainder are all 2012 and 2013 If 5% overlap with UNC ICEES asthma, then N = 132 If 37% overlap with UNC ICEES asthma, then N = 975
Of those 2637, 928 have HE_COMPLETION_DATE and D28_Asthma = 1 If 5% overlap with UNC ICEES asthma, then N = 46 If 37% overlap with UNC ICEES asthma, then N = 343
N = 2593 with D28_Asthma = 0,1 and TLR4_DIST_1X
Additional update from Kara, 01.16.20:
Hao has lat/longs and addresses for all 2709 participants with HE_COMPLETION_Date, so we can integrate the exposures data for the 2705 participants with HE_COMPLETION_DATE in 2012 or 2013, excluding the 4 participants with HE_COMPLETION_DATE in 2014.
We will use use the calendar year prior to HE_COMPLETION_DATE to determine airborne pollutant exposure estimates, using the same calculations for AVG and MAX exposure that we currently use for the ICEES integrated feature tables, but expanding from PM2.5 and ozone to include the eight additional airborne pollutant exposure estimates that we now have.
In other words, we'll calculate one-year exposures over the year prior to survey completion. So, if someone completed the survey on 7/1/2012, then we'll calculate exposures from 7/1/2011 - 7/1/2012 (or 7/2/2011 - 7/1/2012).
For the ACS data, we will use the 2012-2016 estimates.
For the roadway data, we do not really have a choice, as we only have 2016 data.
WRT to column headers for the EPR data, we'll create two sets: one for the UNC data and one for the EPR data.
We'll integrate the UNC and EPR data over all available years to date, i.e., 2010-2016.
In addition to the above plans for integration of UNC and NIEHS EPR asthma cohort data, we will stand up a private ICEES API at NIEHS.
@charles.schmitt@nih.gov : Please let @xuhao@renci.org and me know how we can move this effort forward as quickly as possible. Thanks!
Clarification to comment from @xu-hao on November 28, 2019:
Original comment
Latest cross walk data for EPR data results in
213 matches with 2012 fhir data and
42 matches with icees table.
The difference between these two data sets include:
rows with no lat, lon
rows with age >= 90
which are excluded from icees table
Correction
The matches noted above represent a two-way join between the original cross-walk for UNC-EPR data (i.e., just the hash match) and both the UNC FHIR files for asthma cohort and the final ICEES integrated feature tables for asthma cohort, which are derived from the FHIR files. The hash codes for the original cross-walk between UNCHCS and NIEHS EPR did not align; i.e., the UNCHCS hash codes differed in format from the NIEHS EPR hash codes.
This issue involves a new collaboration with Charles Schmitt and Shepherd Schurman of NIEHS to Investigate the use of the Translator/Reasoner architecture in the context of NIEHS clinical data sources (i.e., EPR, CRU), knowledge sources (i.e., Tox21, DrugMatrix), and new toxicology use cases.