cancerDHC / tools

A repository for the work of the Tools workstream for CCDH
2 stars 1 forks source link

Identify minimum metadata currently used by CRDC nodes #16

Closed gaurav closed 3 years ago

gaurav commented 3 years ago

This was requested as part of our conversation with the Cancer Data Services (CDS) on Sep 25, 2020. There are two approaches we should take here:

Once we are done with this process:

llchristopherson commented 3 years ago

This is what I was trying to think of...

So for metadata for research data, there is already a lot of work done. We should encourage use of it for CDS rather than reinventing the wheel. This would provide minimum metadata for the project and the dataset. (And I'm betting I will find that the all the minimum metadata the nodes use is already available in this schema.) I still think data-element metadata should be harvested. For that, I have a contact we could speak with about this.

What I was thinking of: See data cite: https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf https://schema.datacite.org/meta/kernel-4.3/ https://datacite.org/

Laura

gaurav commented 3 years ago

Melissa Cook, Wendy Ver Hoef and Smita Hastak have been working on documenting the current state of data submissions across the nodes, and are currently working on their deliverable. We should probably wait until they're done, and then attempt to identify the minimum metadata from across the nodes. We can then talk to the Data Model Harmonization team to add that minimum metadata to the CRDC-H.

If we do want to collect the data submission protocols, here are some of there:

llchristopherson commented 3 years ago

This is the data dictionary for PDC (better than the page above because it lists out all the items in the metadata scheme): https://proteomic.datacommons.cancer.gov/data-dictionary/dictionary.html

https://proteomic.datacommons.cancer.gov/data-dictionary/

llchristopherson commented 3 years ago

This is the data dictionary for GDC (again better): https://docs.gdc.cancer.gov/Data_Dictionary/viewer/

llchristopherson commented 3 years ago

ICDC: https://caninecommons.cancer.gov/#/model https://github.com/CBIIT/icdc-model-tool SVG: https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model-tool.svg

llchristopherson commented 3 years ago

HTAN: Metadata and Clinical Data Standards Multiple Working Groups made up of HTAN members, NCI staff, and participants in related consortia are in the process of developing metadata and clinical data standards for HTAN that will capture the full depth and complexity of atlas datasets and will facilitate dataset discovery and integration. Rather than develop new standards, HTAN Working Groups are aiming to adopt and, as needed, adapt and expand upon the efforts of related initiatives, such as the Human Cell Atlas and the Genomic Data Commons. The resulting HTAN standards will ensure that atlas datasets are extensively and consistently annotated regarding sample and disease characteristics as well as methodological approaches. At the same time, HTAN standards will ensure that these datasets are discoverable and accessible for scientific discovery within the larger cancer data ecosystem. All HTAN standards will be released to the public as they become available.

llchristopherson commented 3 years ago

IDC is using DICOM: https://docs.google.com/presentation/d/14i2yZa-0Xvv6N15a6Ko_ounx2g3SktBdMT8poFaW880/edit#slide=id.g62ee9d0e13_0_33 They say there are gaps. And BRIDG: https://bridgmodel.nci.nih.gov/ I can't find anything that about how they are combining all this and they do say that they are relying on some help from CCDH. So it is possible they haven't fully figured out what "their" data model will be.

All data hosted by IDC will be available publicly. Initial content of IDC will be populated using the radiology collections from The Cancer Imaging Archive (TCIA). In the subsequent stages IDC will be expanded to offer digital pathology images, and multispectral data from the Human Tumor Atlas Network (HTAN). IDC will accept data de-identified by TCIA or other Data Coordinating Centers.

IDC will provide access to the data standardized using the Digital Imaging and Communication in Medicine (DICOM) standard. IDC will work with projects generating the data to harmonize alternative formats into DICOM representation. Its content will include not only images, but also image annotations and analysis results, and will be linked using common identifiers to the other types of cancer data, such as proteomics and genomics datasets. Access to the data will be supported using standard interfaces. Analysis tools suggested for use on cloud systems will eventually be containerized and published in central repositories similar to other data coordinating centers. Given the IDC role as an imaging data coordinating center, there will be a major focus on establishing best practices for imaging research. In this regard, one of the goals of IDC is in preparing and adapting commonly used tools for image analysis to be run on cloud environments with the IDC hosted datasets. Summarized derived data from analyses previously run will be associated with imaging data on IDC for ease of use by the research community.

This resource is expected to launch in 2020.

gaurav commented 3 years ago

The CRDC Data Standards Team recently produced a draft report on the metadata used across CRDC nodes. Given that, I don't think this task is required any more. Closing.