data2health / nlp-sandbox

Cloud-based sandbox for text analytics
MIT License
3 stars 1 forks source link

Review documents on DeID work #5

Closed tschaffter closed 4 years ago

tschaffter commented 4 years ago

DeID work is part of the Phase III projects related to "Cloud-based Sandboxes"

Initial documentation is located here:

Tasks

tschaffter commented 4 years ago

Notes

iii.02 | NLP Sandbox - CD2H Phase III Project Proposal

this project will establish a cloud-based sandbox environment in which CTSA hubs can develop, evaluate, and share tools and methods.

Our objectives in doing so are to: 1) reduce redundancies in such efforts and increase economies-of-scale across the CTSA network; 2) ensure the reproducibility and rigor of such assessment tools and methods; and 3) expedite access to “best-of-breed” tools and methods by all CTSA network participants and partners.

As such, the project has three specific aims:

  1. To create a cloud-based environment that can enable the systematic verification and validation of text analytics tools to solve specific tasks (e.g., the “text analytics sandbox”);
  2. To populate the “text analytics sandbox” with necessary and appropriate reference data sets to be used in shared verification/validation tasks; and
  3. To demonstrate the “text analytics sandbox” by engaging a group of CTSA hubs to contribute tools and methods to the project and demonstrate their performance, reproducibility, and rigor in such a shared environment.

The expected impacts of this work are to (1) improve data driven recruitment to clinical trials and clinical research; (2) transition real-world data to real-world evidence; (3) create what is essential infrastructure for a learning health systems; (4) create the phenotyping necessary for precision health, and (5) pave the way for AI in digital health.

iii.03 ML Sandbox Proposal-final

Timeline: Jan-Jun 2020

This proposal addresses four pressing roadblocks for clinical ML: (1) Accessing clinical data and preparing the data for ML; (2) Implementing state-of-the-art ML algorithms for use with clinical data; and (3) Avoiding sources of bias specific to clinical data; and (4) Reproducibility, documentation and transparency of ML tools.

Motivation for requesting an NCATS EC2:

Datasets. We will ask for CTSA datasets that can be stored in the NCATS-CD2H cloud, and for which we can set up a standard and relatively straightforward data access procedure. We will also ask for other datasets that might be useful, such as MIMIC3.

NLP de-identification methods is part of a larger repository of deid methods that CD2H is building:

Ideally, we will obtain a mix of data types including EHR (e.g., i2b2, OMOP, FHIR etc), images (e.g., retinal fundus, chest radiogram, brain CT, etc.), genomic and multi-omic datasets from clinical cohorts, and in the future, other data sources including electrophysiology and wearables.

We will consider this project successful in its first 6 months, if we have obtained requirements and/or datasets from at least half of the CTSA hubs.

In the section Participants:

Justin Guinney, DREAM challenge using data from CAPB platform

In the section Deliverables:

The deliverable at month 6 will be both a GitHub repository with Python code as well as a technical section in the white paper we will submit to NCATS that will detail the proposed software architecture of the final system.

iii.06 | CDData Quality Sandbox

This project will establish a cloud-based sandbox environment in which CTSA hubs can develop, evaluate, and share tools and methods for data quality assessment.

Such outreach and engagement will include: 1) direct interaction with the CTSA network informatics community-of-practice (IEC), both in-person and via collaborative web-based tools; 2) the establishment of a series of interactive webinars to introduce the sandbox environment and its capabilities to the CTSA network; and 3) the adoption of FAIR principles to guide all aspects of the project so as to ensure the rapid shareability of all research products and knowledge generated therein.

Measure of success:

We plan to assess the success of this project using a combination of process (users, access to the sandbox, number of tools/methods and reference data sets being contributed to the environment) and outcomes measures (new tools that are verified/validated and subsequently adopted by other CTSA hubs, new studies or research programs that have generated demonstrable outcomes and that utilized tools/methods developed and demonstrated in the sandbox environment).