UC: Cross-datasets discoverability and use within CFDE and a cloud-workspace environments

ericwenger-pm commented 1 year ago

Title: Cross-datasets discoverability and use within CFDE and a cloud-workspace environments

Scientific Question/Use Case: A key challenge in data discovery is the coordination and assembly of datasets from across Common Fund Data Ecosystem (CFDE) Data Coordination Centers (DCC) in an easy to use and meaningful manner to accelerate usage by researchers. This requires the capacity for end users to 1) find and assemble cohorts of interest 2) allow systems to authorize and authenticate access to the data across resources, and 3) support the capacity for cloud-based workspace analysis of the data that limits the need for downloading of data and permitting an end user to leverage the cloud to bring these data together in a common workspace. We have implemented a manifest-based import on our CAVATICA platform for a user to create a cross-Common Fund dataset cohort in order to accelerate platform-based discovery and clinical translation.

One pager description: In this use case, a manifest is created from the Common Fund Data Ecosystem portal with GTEx and Kids First (KF) neuroblastoma RNA sequencing assays and brought into a collaborative CAVATICA workspace. The user leverages the CFDE Portal in order to select GTEx and Kids First data sets focusing on RNA sequencing assays in order to create a manifest using NCPI file manifest standards. The data is imported into a CAVATICA user project where user’s dbGaP access are checked by CAVATICA and the data becomes accessible only if the user has proper authorization.

The combination of interoperable metadata and harmonized computational framework for RNAseq set the stage for cross-DCC analyses and tool development. We used this study, “Identification of GPC2 as an Oncoprotein and Candidate Immunotherapeutic Target in High-Risk Neuroblastoma” to drive a use-case-based demonstration for the manifest-based import of controlled-access data across different DCCs into a user-managed workspace. Importing the cross-cohort CFDE manifest via DRS into the CAVATICA workspace obviates the need to create a secondary, DCC-specific process of copying data while providing a seamless authentication/authorization framework. As part of a CFDE-assessment, the data from GTEx and Kids First were harmonized with the Kids First RNA-Seq Workflow public app.

To empower further discoverability of analyzed results, we used an Appyter specifically designed to run the downstream analyses as highlighted in the Cancer Cell paper comparing neuroblastoma data and GTEx data to support clinical translation of potential cancer-specific targets. Importantly, the Appyter interoperates with and connects to the CAVATICA workspace where data were processed, harmonized, and structured. We confirm if this new analysis using the Appyters recapitulates the original findings of the cancer cell paper from 2017 with this new larger cohort of neuroblastoma. Indeed, many of the same genes that were initially identified were also identified as part of this analysis via the Appyter. Additional genes were also identified, including PHOX2A, PHOX2B genes that were not initially in the list from the cancer research paper.

Future Directions: While genomic data interoperability is well supported, clinical/phenotypic data integration and associated metadata still faces limitations. To advance these efforts, CFDE is supporting the implementation of FHIR-based standards and the capacity to import clinical phenotypic data directly into a cloud workspace alongside genomic data. Importantly, such efforts are paired with RAS/DRS implementation focusing on UDN data access and interoperability via dbGaP. Like the work performed with GTEx assessing harmonization requirements and downstream analyses, UDN data are included in an ongoing RNAseq pipeline comparison and implementation workstream.

Platforms involved: Common Fund Data Ecosystem, Seven Bridges CAVATICA

Datasets involved: GTEx (controlled access) and Kids First RNA sequencing (open access) data sets

Which repositor(ies): Kids First Data Resource, AnVIL

Which search portal: Common Fund Data Ecosystem Which compute platform(s): CAVATICA

Which workflows/tools/apps/code: Tumor Gene Target Screener appyter

Scientific lead/s and platform leads: Platform contact: Michelle Mattioni, Surya Saha, Robert Carter Researcher contact: Adam Resnick, Eric Wenger

Goals:

Define cohort in CFDE portal with GTEx and Kids First RNA sequencing data sets (Discoverability). GTEx and KF source data are controlled datasets.
Generate manifest file with DRS URIs in NCPI minimal metadata format and export to disk (Data access)
Import data for cohort via CFDE DRS server using the DRS manifest importer on CAVATICA (Interoperability)
Using RAS (eRA Commons) credentials (authentication/authorization)
Using DRS for cloud data access
NCPI manifest for populating metadata for files in CAVATICA Presentation CAVATICA documentation: https://docs.cavatica.org/docs/import-from-a-drs-server Slides: https://prezi.com/view/UBPyyMeP9cj1R3YUE6dT/ Video: https://www.youtube.com/watch?v=Z1hybV-V6ck Poster: https://figshare.com/articles/poster/Manifest-based_DRS_import_A_practical_solution_for_cross-DCC_dataset_analysis_to_empower_translational_discovery_using_Kids_First_and_GTEx_data/21263148

ericwenger-pm commented 1 year ago

@linikujp we would like to be able to bring this research use case to completion, as I believe we have submitted all required details and documentation. Essentially this work was implemented and completed as documented.

Is there a need to translate this issue into a one pager to officially close this out?

ericwenger-pm commented 1 year ago

Based on our team's consultation with Dean Jackman and confirmation to him by Laura Biven (NIH program lead for NCPI), we will be following the direction to move the use case to completion.

Additional Details Per Dean, Laura brought the use case to the attention of the recent weekly Steering Committee (SC), where this was discussed. Use case details were forwarded to relevant SC members by Laura, who confirmed to Dean the use case can be moved to completion.

Great collaboration in working with Dean Jackman and Rashonda Lewis and the NCPI ACC team, and with @linikujp Asiyah in initial setup.

NIH-NCPI / NCPI_use_case_tracker

UC: Cross-datasets discoverability and use within CFDE and a cloud-workspace environments #30