bmir-radx / radx-project

This repo serves as a primary location for tracking issues that don't quite fit into our other dedicated repositories
0 stars 0 forks source link

Advanced analysis of data set characteristics #61

Open marcosmro opened 5 months ago

marcosmro commented 5 months ago

This task corresponds to task XA4.1 in the SOM Mod13.

SOM title: "XA4.1 Perform advanced analysis of data set characteristics (e.g., data types, data formats) in Data Hub." SOM description: "Using semantic classifications and detailed interviews with data providers, develop a model to understand what data sets RADx has, and what features are needed to work with those data sets." Due date: By end of 2 calendar months after 3.0 soft launch. Assuming that the soft launch will happen on 2/21/24, the due date would be 4/21/24.

Additional details here.

jkyu commented 5 months ago

We got a metadata dump of the Data Hub from BAH that is current as of June 2023. We may need a more recent dump to complete this task (although this seems unlikely for the below reason).

The metadata we have now is not conducive to accomplishing this task. The metadata contains minimal information, usually including only the title of the study, the PHS ID, the PI, and the organization responsible for the study. I tried to use the data dictionaries to find hints for categorizing studies, but the lack of harmonization outside of the boilerplate Tier-1 data elements makes a programmatic approach challenging.

To make progress, I am looking up each study by PHS ID on dbGaP, reading the study descriptions, and annotating each study with a vocabulary that I am making up.

jkyu commented 4 months ago

I put the ongoing work in this directory:

The "analysis" so far just tabulates studies by code labels that the Data Hub currently uses for search filters. This will answer simple questions pertaining to the contents of the data hub. I would think that "Study Topic" is the most important filter. This was a topic of discussion in the DHP/NIH office hours yesterday -- curation needs to consider the study description to label the studies correctly.

jkyu commented 2 months ago

We decided to focus on meeting the deliverable as soon as possible by prioritizing a basic version of the reporter tool. It aggregates statistics on coded terms over searchable facets in the Data Hub.

An advanced tool that leverages semantic information (namely, data element ontologies) will come in the future to satisfy the "advanced" part of this task.

The repository can be found here.

During a breakout session at the DHP in-person meeting, we discussed integrations approaches with BAH. We decided on a static report as tabular data. This is possibly just a download of the report spreadsheet. In the future, we would want some sort of dynamic reporting that involves data visualization and allows clicking through to a data hub search on the relevant facets.

jkyu commented 1 month ago

Outstanding items:

jkyu commented 1 month ago

I integrated the hierarchical statistics aggregation into the basic reporter. Next, we want to map the facets used by the Data Hub to terms in the global codebook ontology.

jkyu commented 1 month ago

I'm working on an ontology that includes labels describing the contents of the Data Hub. The ontology can be found here. There is a lot of room for improvement, but the current form is enough to implement logic for reporting using the context of the ontology. Attached is a spreadsheet of the report in its current state.

This uses the LDTab tool developed by Christian to convert an ontology from RDF/XML to TSV (to be read in by the reporter).

The "Original Facet" column notes whether the term is provided externally (e.g., from dbGaP), which makes it a searchable facet in the Data Hub. This column has a false value if the term is introduced by the ontology (it's probably a term borrowed from another Ontology, like MESH). When we have an improved ontology, we can probably add a column that provides the identifier of each term.

xa41_report.xlsx

jkyu commented 1 month ago

Revisions to the report:

Attached here: xa41_report.xlsx

jkyu commented 1 month ago

Feedback from DHP is that the report cannot be published if it features inconsistencies with the study explorer. To accommodate this, we revised the report to:

The reporter tool will now output two versions of the report, one that includes the expanded labeling effort and use of the ontology (previous version) and a new version that does not. The reports have been pushed to the repo here.

The technical report for the deliverable has been updated to reflect this.

There is still value in an ontology that describes study metadata, but further development of RADSAV should be part of another task. RADSAV should be built from scratch, since the current vocabulary list is too restrictive to build an ontology with meaningful structure. This ontology should be developed with purpose and with an agreement from the DHP to use the ontology to rework the study metadata. Short of this, development of this ontology will not be a productive use of time.