cancerDHC / tools

A repository for the work of the Tools workstream for CCDH
2 stars 1 forks source link

Build a workflow for integrating example IDC data against standardized vocabularies #4

Closed gaurav closed 3 years ago

gaurav commented 4 years ago

The IDC has several datasets imported from The Cancer Imaging Archive (TCIA) that include clinical data alongside image data. These data are stored in tabular formats (CSV, XMLX) with non-standardized column names that do not make clear how values should be interpreted. We would like to identify a workflow that can convert this data into datasets containing standardized column names that clearly indicate the meaning of values within that column.

IDC have identified three example datasets to start with, all publicly accessible from TCIA:

Similar harmonization work was done in Fedorov et al PeerJ preprint, which is based on a Lung Image Database Consortium image collection (LIDC-IDRI).

Our goals are:

There are three tools I know of that might be useful here:

fedorov commented 4 years ago

Adding David Clinue (@dclunie) from the IDC team.

fedorov commented 4 years ago

A study of Low Grade Gliomas (also used in the Fedorov et al PeerJ preprint).

Correction to the above - the preprint referenced is for a different study (this one: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041807/). But I think based on the discussion on the call, we can drop the glioma dataset, and focus on the remaining two.

gaurav commented 4 years ago

Hi @fedorov! I've been using the caDSR and the Ptolemy metadata mapping tool to work on first IDC use cases (NSCLC Radiomics), and I have two things to show you:

Was something like this what you were thinking of when you were thinking of harmonized datasets? Is there some additional information you would like in either of these spreadsheets that would help you? Let me know what you think!

fedorov commented 4 years ago

Thank you, let me look into this in detail!

fedorov commented 4 years ago

@gaurav thank you, this is super helpful, and exactly the kind of help I was looking for!

I have couple of questions (may have more later!):

gaurav commented 4 years ago

Hi @fedorov! Thanks so much for your questions, and please feel free to send me more!

Hope that helps! Let me know if any of that was unclear or if you'd like to discuss this via videochat.

fedorov commented 3 years ago

Thank you for the clarification @gaurav!

Given your explanation that the tool you used is not available for general public, what is the path forward - should we work with you on each of the datasets that we need to harmonize?

I can try to reach out to the group that submitted the dataset to clarify the "???" entries.

Thank you for the clarification about NCIt - I did see the codes in the CDE, but did not realize those are NCIt codes.

Is there a recommendation or vision from CCDH on how the harmonized clinical data entities should be stored by the individual nodes, what kind of representation/container/format should be used to keep the results of harmonization?

gaurav commented 3 years ago

Thank you for the clarification @gaurav!

Given your explanation that the tool you used is not available for general public, what is the path forward - should we work with you on each of the datasets that we need to harmonize?

Sorry for not replying sooner to your comment, Andrey -- I've been chatting with the Ptolemy developers to see if we can get you direct access to it to try to harmonize datasets yourself going forward. Would that be useful to you? Do you have a list of datasets you'd like to harmonize next?

I can try to reach out to the group that submitted the dataset to clarify the "???" entries.

Yay, thanks for that! I'm not sure what the the best way to validate mappings would be going forward. It might make sense to ask original authors to confirm that we've mapped all their columns correctly, but that might also be too much work. Maybe we could have node representatives who can check that their datasets have been harmonized correctly?

Is there a recommendation or vision from CCDH on how the harmonized clinical data entities should be stored by the individual nodes, what kind of representation/container/format should be used to keep the results of harmonization?

Not yet, but I'm working on that (#15)! I think CEDAR instances or PFB will probably turn out to be the best format. I'm going to try converting the example harmonized datasets I've generated in this issue into those two formats and see how they turn out.

gaurav commented 3 years ago

Note that the Ptolemy developers already have some experience in working on TCIA datasets: I believe the Clinical Data XLSX file in the QIN-BREAST-02 dataset was harmonized using Ptolemy. Even if that is not the case, it's still a nice potential format for recording the CDEs against which each column has been harmonized.

gaurav commented 3 years ago

I've finished harmonizing the other dataset in this task (ISPY1), which contained two datasets. I've written descriptions of both columns in the columns from TCIA spreadsheet, while harmonized datasets are available at ISPY1 patient clinical subset and ISPY1 TCIA outcomes subset. I've also added all of this information to the csv2caDSR Github repository.

The next step on this task is to write some sort of automated tests based on these examples for csv2caDSR and figure out with IDC and the Ptolemy.V developers what the next step should be for IDC's data harmonization needs. Once that's done, I'll close this issue.

gaurav commented 3 years ago

I've created an issue for automated tests at https://github.com/gaurav/csv2caDSR/issues/5, and organized a meeting with IDC and the Ptolemy.V developers to discuss the next steps on October 2, 2020. Since this level of harmonization appears to meet IDC's current needs, I'll go ahead and close this issue.

Note that we will continue investigating formats to store the harmonized data in #15.