ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Green Team - NIEHS Collaboration #114

Open karafecho opened 6 years ago

karafecho commented 6 years ago

This issue involves a new collaboration with Charles Schmitt and Shepherd Schurman of NIEHS to Investigate the use of the Translator/Reasoner architecture in the context of NIEHS clinical data sources (i.e., EPR, CRU), knowledge sources (i.e., Tox21, DrugMatrix), and new toxicology use cases.

karafecho commented 6 years ago

Preliminary meetings have taken place and initial scientific and technical use cases have been developed. ROBOKOP code and data sources are being explored.

karafecho commented 6 years ago

Meeting with Charles, Shepherd, and Christine scheduled for 1 pm, Wednesday, April 18.

karafecho commented 6 years ago
karafecho commented 6 years ago

DUA accepted by NIEHS. HuSH+ dataset received.

karafecho commented 6 years ago

Charles Schmitt's team is planning to develop a workflow to address toxic agent-centric questions such as those listed here. Relevant NIEHS data sources will be identified for eventual exposure as a Translator ChemTox Smart API. Resources are limited, however, so this is a low-priority project for NIEHS.

karafecho commented 6 years ago

7/20/2018 - Alex, Steve, Chris, and I met with Charles' Tox group and Data Science group to provide an overview of the Translator program and Green/Gamma's role in the program. We requested their assistance with Module 3, MVP1.

@cpschmitt : Please provide an update.

schmittcp commented 6 years ago

08/02/18 - Several updates:

On the clinical side, we're submitting a request this week to the EPR contract team to pull genotype data and linking hashcodes. The geneotype data will be limited for now to TLR4 related SNPS. This will allow us to include the genotype data into the UNC clinical service and validate prior findings on TLR4 association with ashma and distance to roadway (the paper on this from Shepherd was just published). The linking hashcodes are those that UNC has on their own paper through the work that David Borland has done with Tracs on linking UNC and EPR patients (algorithm from Ashok).

On the clinical side, we are developing a second use case around immune mediated diseases with Dr. Fred Miller at NIEHS. I'll work with Kara on the formulation of the use case. Presumably we would take the same approach as with Asthma to augment the UNC clinical service.

Also, the EPR contract team is starting to draft documents with UNC for a more general data sharing arrangement between UNC and NIEHS based on several prior conversations with UNC. Dave Peden has agreed to serve as the UNC PI on this.

On the tox side, Resham Kulkarni is taking a first pass at the chemical to tox phenotype use case that Scott Auerbach had mentioned and how it relates to the Translator and the module 3. She should have that this week. I'll work on it next, then have Scott review it. Then we'll run it by the Green team.

karafecho commented 6 years ago

09/06/18 update

From Charles:

Independent of this, I'll be looking at AOPs closer for NTP's purposes and will let you know if there's a way to link those into translator (although I'd encourage you to also look if you haven't).


From Kara:

Consulted with Charles on (1) plans to generate a new ICEES cohort (i.e., new tables) using the EPR, which includes an asthma sub-cohort; and (2) a new EPR use case on immune-mediated disease

karafecho commented 6 years ago

Update on Green/Gamma collaboration with NIEHS (Charles Schmitt), 11/2/2018:

  1. EPR asthma use case - tagged EPR participants in CDWH for incorporation into ICEES; goal is to replicate a prior study; genome-wide sequencing of all EPR participants is expected to be complete by Fall 2019; targeting overlapping population initially (roughly 5000 patients)
  2. EPR immune-mediated disease use case - common risk factors and biological processes across immune-mediated diseases; SMEs to determine risk factors and biological processes; EPR for validation; CTD for clustering on genes, chemicals, diseases/phenotypes; ROBOKOP and neo4j for identifying additional genes, pathway, diseases/phenotypes; ??? tool for visualizing and curating bipartite graphs
  3. Tox21 Enricher - associations between chemicals and Tox21 chemicals (10,000), assays (300); includes Drug Matrix, Leadscope; chemical structure similarity capability; toxicity predictions; need to map assays to ontologies; data to be stored at RENCI, NIEHS to develop API and send dictionary of desired data elements
  4. Not discussed, but referred to during the meeting: Ilya Baldin’s NSF ImPACT award and ICEES as a use case for secure multiparty computation

    Modification to approved Green Team IRB protocol on asthma-like patients:

Goal: We seek to add additional patient level data elements to inform the existing research study. The new data elements will include a limited set of genotype calls and responses to two survey questions (EPR Health and Exposure Survey, EPR Exposome Survey) that relate to the current Asthma use case. The data elements are available from the Environmental Polymorphism Registry (EPR), a longitudinal research study being conducted by the NIEHS. The EPR has enrolled around 19,000 subjects in North Carolina which includes approximately 5000 subjects who are in the UNC CDW-H (as determined from a prior UNC-NIEHS study). We plan to add data elements from the EPR to only those subjects that are in the EPR and are in the cohort for this study.

The genotype data is focused on a small set of variants (4 total) that relate to the TLR4 pathway. The survey questions include a broad range of questions related to environmental exposures and general health. In a prior EPR-based study, we found significant differences in suspectibility to Asthma for patients based on distance to roadways and based upon their genotype for these variants. Under this modification, we will extend the existing Asthma analysis to include these genotypes and attempt to confirm the prior findings. We also plan to incorporate the survey questions and genotype data into the existing study analysis in order to uncover potential relationships between Asthma outcomes, clinical data, environmental exposure measures, survey responses, and genotypes. We note that this work is only meant to extend the set of data elements in the existing study and not to pursue additional research goals.

Safeguards: In prior work between UNC and EPR, we have developed a method to link records from patients seen at both institutions based upon the use of one way hash codes. The hash codes are derived from identifying information (age, sex, name, address), although once computed the identifying information is no longer needed and cannot be regenerated from the hash codes. We plan to use the hash codes to allow linking the existing UNC study data with the EPR data elements. As UNC and EPR share identical hash codes, we can add the EPR data elements to existing de-identified study data at UNC, thus the EPR data elements will not be added to identified patient records and the NIEHS will not to transfer identifying information to UNC. EPR will provide the data elements with the hash codes using secure file transfer.

karafecho commented 5 years ago

@schmittcp will work with @lstillwe and Sue Nolte to develop an API for Tox21 Enricher data. Charles, please send Sue's GitHub user name to @rayi113 so that he can add her to this repo.

karafecho commented 5 years ago

@schmittcp @szcc @lstillwe : I'm hoping that the three of you can coordinate on the Tox21 Enricher API.

karafecho commented 5 years ago

Update, 4/19/19:

  1. Publishing this TIDBIT (see comments here).
  1. Moving EPR data to the CDWH in an effort to expand ICEES to include data on subjects in the EPR, including EPR-specific survey findings.
  1. Comparing results from (2) with those that your folks are generating outside of a firewall via integration of EPR data with data from Green Team's Exposures Services.
  1. Replicating published results on roadway exposures/genes/asthma by way of (3).
karafecho commented 5 years ago

Update on status of ongoing projects:

  1. NIEHS EPR data (survey data, SNP data) and RENCI Translator ICEES asthma-like cohort (Shepherd, Charles, Kara, Emily, Hao)
  1. NIEHS EPR data and UNC i2b2 data (Shepherd, Charles, Emily, David Borland, Ashok Krishnamurthy)
  1. IMD subset of NIEHS EPR data and RENCI Translator ROBOKOP (Fred, Shepherd, Charles, Kara); Kara to coordinate with (Chris Bizon, Kenny Morton, Alex Tropsha, Eugene Muratov, Vini Alves, Joyce Borba)
  1. NIEHS EPR and RENCI Translator Environmental Exposure Services/APIs (Shepherd, Charles, Kara); Kara to coordinate with (Sarav Arunachalam, Alex Valencia Arias)
xu-hao commented 4 years ago

Latest cross walk data for EPR data results in

The difference between these two data sets include:

  1. rows with no lat, lon
  2. rows with age >= 90

which are excluded from icees table

szcc commented 4 years ago

We have some Tox21 APIs but currently there are behind firewall. We'll make it public available asap.

On Thu, Nov 28, 2019 at 1:20 PM xu-hao notifications@github.com wrote:

Latest cross walk data for EPR data results in

  • 213 matches with 2012 fhir data and
  • 42 matches with icees table.

The difference between these two data sets include:

  1. rows with no lat, lon
  2. rows with age >= 90

which are excluded from icees table

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/114?email_source=notifications&email_token=ABLFBAZA6WD33JQM3GPUCT3QWADWPA5CNFSM4ETDL6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNHY6Y#issuecomment-559578235, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLFBAZ3TENNTMNIQDI2ZY3QWADWPANCNFSM4ETDL6EA .

karafecho commented 4 years ago

@szcc : Sounds great.

Forgive my ignorance, but I don't recognize your user name. Perhaps you can remind me?

szcc commented 4 years ago

Sue Nolte from NIEHS

On Mon, Dec 2, 2019 at 10:46 AM karafecho notifications@github.com wrote:

@szcc https://github.com/szcc : Sounds great.

Forgive my ignorance, but I don't recognize your user name. Perhaps you can remind me?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/114?email_source=notifications&email_token=ABLFBA6S66BYYAUIQ6L5UQLQWUUU7A5CNFSM4ETDL6EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFT5LNA#issuecomment-560453044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLFBAYATPPEDQ73JKO2T5DQWUUU7ANCNFSM4ETDL6EA .

karafecho commented 4 years ago

👍

karafecho commented 4 years ago

Emily to resolve hashing issues the week of January 13, 2020.

karafecho commented 4 years ago

@xu-hao : I checked with the EPR folks regarding the EPR "HE_COMPLETION_DATE" variable. This is the variable we should use for integration. So, we should examine exposures during the year prior to survey completion for the CMAQ airborne exposures data. For the ACS socio-economic data, we should use the 5-year block in which the survey was completed. For the roadway data, 2016 is the only option. Does this seem reasonable? Will this present any major challenges?

karafecho commented 4 years ago

@xu-hao : To clarify the above plan, we currently have ICEES structured in a somewhat random manner. Specifically, for patient tables, ages are calculated with respect to January 1 of each one-year 'study' period, and exposures and outcomes are examined over the same one-year 'study' period. For visit tables, ages are calculated with respect to visit date, exposures are examined over the 24 hours prior to visit date, and outcomes are examined with respect to the visit itself.

The EPR data are structured differently. Specifically, ages are calculated with respect to the data of survey completion ("HE_COMPLETION_DATE"), and outcomes are reported at the time of survey completion (although some of the variables refer to lifetime metrics). I think it makes the most sense to examine exposures over the one-year period prior to survey completion. If this is too challenging, or will take too much time, then we can compromise, at least for the demonstration project, and examine exposure over the year in which the survey was completed. In other words, if a participant completed the survey on 6/1/2016, then we would examine exposures over the course of 2016.

Does this make sense? If so, which of the above plans is the most feasible.

@xu-hao : let's discuss this tomorrow (Tuesday, 1/14/20).

karafecho commented 4 years ago

From Emily, new hash matching results, 1/10/2020:

UNCHCS denominator: 2,770,607 patients

NIEHS EPR denominator: 19,388 participants

Matched: 7,233 people (37% of all EPR participants)

karafecho commented 4 years ago

From Emily, 1/14/2020:

Crosswalk is up on Rockfish, in /opt/RENCI/output/FHIR. Filename is UNC_NIEHS_XWalk_for_Hao.csv. The UNC identifier should match the patient IDs you have the in the FHIR files. The hash is what will match with NIEHS.

karafecho commented 4 years ago

Update from Kara on EPR asthma cohort data, 01.16.20:

N= 4129 total, all with SNP data

N= 2709 with HE_COMPLETION_DATE Of those 2709, 4 participants have HE_COMPLETION_DATE in 2014, and 5 have HE_COMPLETION_DATE in 2018, remainder are all 2012 and 2013

N = 2637 with HE_COMPLETION_DATE and D28_Asthma = 0,1 Of those 2637, 1 participant has HE_COMPLETION_DATE in 2014, remainder are all 2012 and 2013 If 5% overlap with UNC ICEES asthma, then N = 132 If 37% overlap with UNC ICEES asthma, then N = 975

Of those 2637, 928 have HE_COMPLETION_DATE and D28_Asthma = 1 If 5% overlap with UNC ICEES asthma, then N = 46 If 37% overlap with UNC ICEES asthma, then N = 343

N = 2593 with D28_Asthma = 0,1 and TLR4_DIST_1X

karafecho commented 4 years ago

Additional update from Kara, 01.16.20:

Hao has lat/longs and addresses for all 2709 participants with HE_COMPLETION_Date, so we can integrate the exposures data for the 2705 participants with HE_COMPLETION_DATE in 2012 or 2013, excluding the 4 participants with HE_COMPLETION_DATE in 2014.

We will use use the calendar year prior to HE_COMPLETION_DATE to determine airborne pollutant exposure estimates, using the same calculations for AVG and MAX exposure that we currently use for the ICEES integrated feature tables, but expanding from PM2.5 and ozone to include the eight additional airborne pollutant exposure estimates that we now have.

In other words, we'll calculate one-year exposures over the year prior to survey completion. So, if someone completed the survey on 7/1/2012, then we'll calculate exposures from 7/1/2011 - 7/1/2012 (or 7/2/2011 - 7/1/2012).

For the ACS data, we will use the 2012-2016 estimates.

For the roadway data, we do not really have a choice, as we only have 2016 data.

WRT to column headers for the EPR data, we'll create two sets: one for the UNC data and one for the EPR data.

We'll integrate the UNC and EPR data over all available years to date, i.e., 2010-2016.

karafecho commented 4 years ago

In addition to the above plans for integration of UNC and NIEHS EPR asthma cohort data, we will stand up a private ICEES API at NIEHS.

@charles.schmitt@nih.gov : Please let @xuhao@renci.org and me know how we can move this effort forward as quickly as possible. Thanks!

karafecho commented 4 years ago

Clarification to comment from @xu-hao on November 28, 2019:

Original comment

Latest cross walk data for EPR data results in

213 matches with 2012 fhir data and
42 matches with icees table.

The difference between these two data sets include:

rows with no lat, lon
rows with age >= 90

which are excluded from icees table

Correction

The matches noted above represent a two-way join between the original cross-walk for UNC-EPR data (i.e., just the hash match) and both the UNC FHIR files for asthma cohort and the final ICEES integrated feature tables for asthma cohort, which are derived from the FHIR files. The hash codes for the original cross-walk between UNCHCS and NIEHS EPR did not align; i.e., the UNCHCS hash codes differed in format from the NIEHS EPR hash codes.