Build a workflow for integrating example IDC data against standardized vocabularies

gaurav commented 4 years ago

The IDC has several datasets imported from The Cancer Imaging Archive (TCIA) that include clinical data alongside image data. These data are stored in tabular formats (CSV, XMLX) with non-standardized column names that do not make clear how values should be interpreted. We would like to identify a workflow that can convert this data into datasets containing standardized column names that clearly indicate the meaning of values within that column.

IDC have identified three example datasets to start with, all publicly accessible from TCIA:

Clinical and survival data from non-small cell lung cancer (NSCLC) patients.
Clinical and survival data from stage 2 or 3 breast cancer receiving neoadjuvant chemotherapy (NACT).
~~A study of Low Grade Gliomas.~~

Similar harmonization work was done in Fedorov et al PeerJ preprint, which is based on a Lung Image Database Consortium image collection (LIDC-IDRI).

Our goals are:

[x] To develop a rapid, semi-automated workflow for standardizing these three example datasets, with the goal of using them on other datasets in the future.
[ ] To choose a semantically-rich representation for the harmonized datasets that promotes reuse.
[x] To report on how far this process can be automated or semi-automated with software tools, and to identify any additional tools that could be developed to speed this process along.

There are three tools I know of that might be useful here:

CEDAR provides the ability to build templates or forms for metadata collection and validation. It does not appear to include metadata mapping tools, but it might be possible to do this externally. The CEDAR metadata instance format might be used to publish the final harmonized datasets.
Ptolemy is a closed-source tool that is designed for this exact use-case: standardizing values in datasets to valuesets in caDSR.
OpenRefine is an open-source tool for cleaning and reconciling tabular data. D2Refine includes some OpenRefine plugins for reconciling data against CTS-2 repositories.

fedorov commented 4 years ago

Adding David Clinue (@dclunie) from the IDC team.

fedorov commented 4 years ago

A study of Low Grade Gliomas (also used in the Fedorov et al PeerJ preprint).

Correction to the above - the preprint referenced is for a different study (this one: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041807/). But I think based on the discussion on the call, we can drop the glioma dataset, and focus on the remaining two.

gaurav commented 4 years ago

Hi @fedorov! I've been using the caDSR and the Ptolemy metadata mapping tool to work on first IDC use cases (NSCLC Radiomics), and I have two things to show you:

I started by mapping all the columns and their enumerated values. I mapped the values to caDSR Common Data Elements (CDEs), linked in the "caDSR Element Mapping" column. For example, I mapped clinical.T.Stage to CDE 3745011, v1.0. If you click through to the CDE browser and click on the "Value Domain" tab, you will find the eleven permissible values (PVs) for the clinical T stage. It was then pretty straightforward to map those PVs to the values in the NSCLC-Radiomics CSV file. Where I ran into issues doing the mapping, I put "???" in the "caDSR Label" field. PVs usually include a detailed description of the meaning of the term, and often are also associated with concepts in the NCI Thesaurus. For example, clinical T stage T1 has been mapped to the concept NCIt C88868.
I then wrote a program to map from the enumerated values to the caDSR PVs, which I then turned into a Google spreadsheet. This spreadsheet includes all the values from the original NSCLC-Radiomics spreadsheet, but with additional columns showing the PVs each value was mapped to, with a link to the NCI Thesaurus entity corresponding to that PV.

Was something like this what you were thinking of when you were thinking of harmonized datasets? Is there some additional information you would like in either of these spreadsheets that would help you? Let me know what you think!

fedorov commented 4 years ago

Thank you, let me look into this in detail!

fedorov commented 4 years ago

@gaurav thank you, this is super helpful, and exactly the kind of help I was looking for!

I have couple of questions (may have more later!):

what is the process you used to map values to the specific data elements?
How did you decide between various options for a given data element? For example, if I search for "male gender" I get 137 hits in the CDE Browser. Of course, some of those hits are clearly not suitable, but not all of them - for example, how did you decide to choose CDE 62, v6.0 and not CDE 2453607, v1.0?
for the items marked with "???", did you investigate whether this is because the CDE value set is incomplete, or the value included in the source spreadsheet is invalid? Do you have any suggestions how to resolve these conflicts?
How did you map individual CDE values to the NCIt codes?
It is NCIt codes that we would use to encode the harmonized values, correct?

gaurav commented 4 years ago

Hi @fedorov! Thanks so much for your questions, and please feel free to send me more!

I used a combination of a software tool called Ptolemy.V (that is currently in available to a small group of people for pilot testing) and a small software tool I wrote myself. If this workflow is required, we can discuss with NCI (who is currently funding Ptolemy.V development) whether to make this tool available for your needs or if we will need to develop something ourselves.
I chose the CDEs to use in two ways: (1) Ptolemy.V has a feature that allows it to rank possible CDEs based on its name and its list of permissible values, and (2) trying to find a "good enough" match on the CDE website. I tried to find CDEs which were as specific as possible (e.g. while many CDEs were available for Clinical T Stage, I found that 3745011v1 was specific to lung cancer) and with as close a match from permissible values to values in the column as possible (e.g. while both CDE 62 and CDE 2453607 have gender mappings, the second has an "unknown" value that we don't need for this dataset). We're still discussing the best way to do this for CRDC-H, but I think we will either have our recommended permissible values for each field, or will ask that fields use permissible values from particular CDEs or ontologies.
I think we'll need experts to figure out the best way to match items marked "???" -- it may be that a particular permissible value already has the intended meaning of the value in the dataset, or that the CDE needs to be updated to included the new value, or a new term is needed to represent this meaning.
CDE entries already include information on the NCI Thesaurus that each permissible value maps to. For example, if you look up CDE 3745011 and click on the "Value Domain" tab, you will find a list of permissible values, their "Meaning Concept Codes" (NCIt codes) and descriptions.
I think we can use any clearly defined identifiers to encode harmonized values. NCIt codes are convenient, because not only are they clearly defined, but also because they can be mapped to terms in other ontologies via the NCI Metathesaurus: for example, NCIt C4917 is mapped to NCImt C0149925, which maps this concept to concepts from a number of different sources, including SNOMED, which might be useful in understanding what this value means.

Hope that helps! Let me know if any of that was unclear or if you'd like to discuss this via videochat.

fedorov commented 3 years ago

Thank you for the clarification @gaurav!

Given your explanation that the tool you used is not available for general public, what is the path forward - should we work with you on each of the datasets that we need to harmonize?

I can try to reach out to the group that submitted the dataset to clarify the "???" entries.

Thank you for the clarification about NCIt - I did see the codes in the CDE, but did not realize those are NCIt codes.

Is there a recommendation or vision from CCDH on how the harmonized clinical data entities should be stored by the individual nodes, what kind of representation/container/format should be used to keep the results of harmonization?

gaurav commented 3 years ago

Thank you for the clarification @gaurav!

Given your explanation that the tool you used is not available for general public, what is the path forward - should we work with you on each of the datasets that we need to harmonize?

Sorry for not replying sooner to your comment, Andrey -- I've been chatting with the Ptolemy developers to see if we can get you direct access to it to try to harmonize datasets yourself going forward. Would that be useful to you? Do you have a list of datasets you'd like to harmonize next?

I can try to reach out to the group that submitted the dataset to clarify the "???" entries.

Yay, thanks for that! I'm not sure what the the best way to validate mappings would be going forward. It might make sense to ask original authors to confirm that we've mapped all their columns correctly, but that might also be too much work. Maybe we could have node representatives who can check that their datasets have been harmonized correctly?

Is there a recommendation or vision from CCDH on how the harmonized clinical data entities should be stored by the individual nodes, what kind of representation/container/format should be used to keep the results of harmonization?

Not yet, but I'm working on that (#15)! I think CEDAR instances or PFB will probably turn out to be the best format. I'm going to try converting the example harmonized datasets I've generated in this issue into those two formats and see how they turn out.

gaurav commented 3 years ago

Note that the Ptolemy developers already have some experience in working on TCIA datasets: I believe the Clinical Data XLSX file in the QIN-BREAST-02 dataset was harmonized using Ptolemy. Even if that is not the case, it's still a nice potential format for recording the CDEs against which each column has been harmonized.

gaurav commented 3 years ago

I've finished harmonizing the other dataset in this task (ISPY1), which contained two datasets. I've written descriptions of both columns in the columns from TCIA spreadsheet, while harmonized datasets are available at ISPY1 patient clinical subset and ISPY1 TCIA outcomes subset. I've also added all of this information to the csv2caDSR Github repository.

The next step on this task is to write some sort of automated tests based on these examples for csv2caDSR and figure out with IDC and the Ptolemy.V developers what the next step should be for IDC's data harmonization needs. Once that's done, I'll close this issue.

gaurav commented 3 years ago

I've created an issue for automated tests at https://github.com/gaurav/csv2caDSR/issues/5, and organized a meeting with IDC and the Ptolemy.V developers to discuss the next steps on October 2, 2020. Since this level of harmonization appears to meet IDC's current needs, I'll go ahead and close this issue.

Note that we will continue investigating formats to store the harmonized data in #15.

cancerDHC / tools

Build a workflow for integrating example IDC data against standardized vocabularies #4