Provide user with the selection of the analysis results collections under the Search Scope tab.

s-paquette commented 3 years ago

Broken out from #590

A few things we need to consider:

What is the value we're displaying for selection? The DOI, the name of the results collection, something else...?
How do we plan to keep this list of values up to date? Right now, all attributes used in filtering are a complete list of values, but this would imply a sub-set of values (analysis results DOIs, as opposed to ALL DOIs)
Counts on the derived tab only pertain to derived data, but a filter pertains to all of IDC. How does selecting an analysis result set (by selecting a DOI or the name, or however we plan to filter it) impact the overall cohort as it pertains to the original collection this may stem from?

As analysis results are effectively series-level, this will have to be a series-applicable attribute filter, so we'll need to support series level filtering in order for this to proceed.

fedorov commented 3 years ago

Here are the short names for the analysis results collections we currently include, as coordinated with TCIA (see this issue https://help.cancerimagingarchive.net/servicedesk/customer/portal/1/TH-47291):

PROSTATEx Zone Segmentations - PROSTATEx-Seg-Zones
High Resolution Prostate Segmentations for the ProstateX-Challenge - PROSTATEx-Seg-HiRes
DICOM SR of clinical data and measurement for breast cancer collections to TCIA - DICOM-SR-Breast-Clinical
Standardized representation of the TCIA LIDC-IDRI annotations using DICOM - DICOM-LIDC-IDRI-Nodules
RIDER Lung CT Segmentation Labels from: Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach - RIDER-LungCT-Seg
QIN multi-site collection of Lung CT data with Nodule Segmentations - QIN-LungCT-Seg

fedorov commented 3 years ago

Proposed idea: have analysis results listed and selectable under a panel that is sibling to the Collections panel in "Search scope".

fedorov commented 3 years ago

Per suggestion from @G-White-ISB, I prepared a DataStudio dashboard trying to demonstrate one idea of how this would work.

First, I expanded the instance-level metadata to include short name corresponding to the DOI this specific instance corresponds to:

WITH
  with_collection_type AS (
  SELECT
    ID,
    DOI,
    "analysis" AS collection_type
  FROM
    `canceridc-data.idc_current.analysis_results_metadata`
  UNION ALL
  SELECT
    tcia_api_collection_id,
    DOI,
    "original" AS collection_type
  FROM
    `canceridc-data.idc_current.original_collections_metadata` )
SELECT
  dicom_all.*,
  with_collection_type.ID AS doi_collection_id,
  with_collection_type.collection_type
FROM
  `canceridc-data.idc_current.dicom_all` AS dicom_all
JOIN
  with_collection_type
ON
  dicom_all.source_DOI = with_collection_type.DOI

Then I updated IDC cohort dashboard template here to: 1) show only the names of the collections corresponding to the non-analysis DOIs 2) add a selector that shows names of the analysis collections

It kind of works, except the collections selector not populating correctly when the analysis results collection is selected (the table underneath appears to be populated correctly). I have not been able to figure out why that is happening.

2021-10-14_17-05-27

fedorov commented 3 years ago

Thinking more about this, the behavior of the dashboard above is not how I would think the IDC webapp should behave, and I don't know immediately how to mimic that expected behavior in DataStudio.

Let's say we have a filter group that allows the user to select from the list of available analysis results, which in turn are identified by the DOIs in the analysis_results_collections tables, and the source_DOI in dicom_all. When the user selects a given analysis results collection, the filter should result in the counts reflecting the number of cases that have items from the selected analysis results collection(s). Does this make sense @G-White-ISB ?

G-White-ISB commented 3 years ago

So as I understand it you would like to have a separate 'analysis collections' list that is separately selectable and the stats calculated would reflect both the analysis and image collections selected. @s-paquette would know more about how difficult the stats calculations then become. Right now it seems we have a 1-1 map between image and an analysis collections. If you select an image collection, why would you not want to select the analysis collection if it is present?

fedorov commented 3 years ago

Right now it seems we have a 1-1 map between image and an analysis collections.

No, that's not the case. We already have analysis collections that span cases from multiple original collections. And I am pretty sure we have analysis collections that cover a subset of cases from original collection(s).

G-White-ISB commented 3 years ago

Sorry I did not check these analysis collections close enough. I think we need an analysis collection field in the derived data solr node, and then it would be straightforward to handle it in the UI.

fedorov commented 3 years ago

Related to #222

fedorov commented 2 years ago

Per @s-paquette we will need to have all low caps underscore versions of the IDs for the analysis results. In addition to this:

we will need something like this:

@bcli4d

fedorov commented 2 years ago

We do not have the description. It is not available via TCIA API. Agreed to use long title instead of manually scraping longer description from the wiki pages, and also have the DOI URL to the collection wiki page in the tooltip over the analysis collections in the portal UI.

fedorov commented 2 years ago

One issue we need to decide is whether we should use DOIs or something else to tag collections.

fedorov commented 2 years ago

@bcli4d can you please clarify the process for assigning DOIs to instances?

fedorov commented 2 years ago

from @bcli4d:

All the code is in the etl_flow repo: utilities/get_collections_dois.py: get_internal_series_ids() returns a list of IDs of all series in a collection or a patient in a collection. These are internal (to NBIA I guess) IDs; they are not, e.g. SeriesInstanceUIDs. The third_party param controls whether the series returned are in the original data collection (third_party = False), or in some analysis result (third_party=True).

utilities/get_collections_dois.py:get_data_collection_doi() "drills down" to convert those IDs tp SeriesInstanceUIDs. Similarly get_analysis_collection_dois().

bcli4d commented 2 years ago

To clarify a bit more... At the start of ingesting a collection, I get the original collection DOI, and a list of (third party SeriesInstanceUID, third party DOI). Then when I add a series, if it is in the list of third party series, I use the associated third party DOI. If it is not in the third party series list, then it must be from the original collection, and I assign the original collection DOI. This, of course, assumes that there is a single DOI per original collection, but that is implicit in TCIA/NBIA.

fedorov commented 2 years ago

Relevant email communication from TCIA's Kirk Smith re one DOI per SOPInstanceUID assumption, from Feb 28, 2022:

Hi Andrey,

I’ll be giving more thought to this in the coming week, but wanted to reply as I have been out of office the last week.

“The key assumption for us right now is that every DICOM instance can correspond to one and only one entry in either original or analysis collection.”

I believe that statement to be true for Data within NBIA as I think there is a one to one relationship of DOI to a DICOM instance.

In practice that is our goal and it should mostly be true, however, I know there is at least one early case that could lead to confusion. The collection is HNSCC. We first received HNSCC data from a submitting site and called it a collection. The data had a corresponding manuscript. Later on another group from the same site used data from the same trial, had some overlapping subjects (no overlapping series) and had its own related manuscript.

At that point we decided we needed a parent collection that contained all data and two separate Analysis Results collections that contained the related data for each manuscript.

So the HNSCC Collection itself has a DOI and a DOI Landing Page with access to download all of the data. Each of the two Analysis Results pages have their own DOI and the download for them has download access to the DICOM for the portion of the parent collection related to the manuscript.

Our current policy on Analysis Results would not have allowed this, instead the original images would have been the parent collection and only related segmentations etc would have been part of the Analysis Results.

For HNSCC the DICOM images stored in NBIA only have the DOI of the Parent Collection and not of the Analysis Results.

I don’t know if there may be other anomalies, but in general your statement is correct and going forward will be correct per our current policy on Analysis Results.

Adding Justin and Scott Gustafson to the thread.

Thanks,

Kirk

fedorov commented 2 years ago

Per discussion today, we should proceed with the implementation and not wait for TCIA revisions to the data model.

s-paquette commented 2 years ago

Tooltips are now added to the analysis results section. They're not displaying quite right due to the attribute being under Original and not Search Scope; once it moves, they should display properly.

For now, the tooltip is simply the title of the Analysis Results. Anything more complex would need to be added in via the description field of the BigQuery table.

G-White-ISB commented 2 years ago

I see that in the dev portal there is a script with an id analysis_results_tooltips, similar to the collections_tooltips script, but it's empty. Running locally I don't see analysis_results_tooltips in the explorer page context

fedorov commented 2 years ago

Discussed and decided to add description column for analysis results collections, which will include DOI URL. I will create those and pass to @bcli4d

s-paquette commented 2 years ago

@G-White-ISB Sorry there, missed a few steps. You can now download a new database see from idc-dev-files and pull from Common master, then refresh the database. That'll get you the new column plus the current analysis results tooltips. (Dev will be done building in a few minutes.)

fedorov commented 2 years ago

Analysis collections descriptions passed to @bcli4d here (3rd column of the table): https://docs.google.com/document/d/1JF1UmvMgvEUutmpXz_UAlpbPdFyEen3lEkBR_-CwXtc/edit?usp=sharing.

TCIA has been informed via this ticket: https://help.cancerimagingarchive.net/servicedesk/customer/portal/1/TH-49649.

bcli4d commented 2 years ago

I converted Andrey's doc to this spreadsheet. It is an easier form to load into BQ, PSQL, etc.

fedorov commented 2 years ago

Thanks Bill, I agree. I used text document for the sake of editing convenience.

pgundluru commented 2 years ago

Tool tips included for each analysis results

ImagingDataCommons / IDC-WebApp

Provide user with the selection of the analysis results collections under the Search Scope tab. #593