Closed s-paquette closed 2 years ago
Here are the short names for the analysis results collections we currently include, as coordinated with TCIA (see this issue https://help.cancerimagingarchive.net/servicedesk/customer/portal/1/TH-47291):
Proposed idea: have analysis results listed and selectable under a panel that is sibling to the Collections panel in "Search scope".
Per suggestion from @G-White-ISB, I prepared a DataStudio dashboard trying to demonstrate one idea of how this would work.
First, I expanded the instance-level metadata to include short name corresponding to the DOI this specific instance corresponds to:
WITH
with_collection_type AS (
SELECT
ID,
DOI,
"analysis" AS collection_type
FROM
`canceridc-data.idc_current.analysis_results_metadata`
UNION ALL
SELECT
tcia_api_collection_id,
DOI,
"original" AS collection_type
FROM
`canceridc-data.idc_current.original_collections_metadata` )
SELECT
dicom_all.*,
with_collection_type.ID AS doi_collection_id,
with_collection_type.collection_type
FROM
`canceridc-data.idc_current.dicom_all` AS dicom_all
JOIN
with_collection_type
ON
dicom_all.source_DOI = with_collection_type.DOI
Then I updated IDC cohort dashboard template here to: 1) show only the names of the collections corresponding to the non-analysis DOIs 2) add a selector that shows names of the analysis collections
It kind of works, except the collections selector not populating correctly when the analysis results collection is selected (the table underneath appears to be populated correctly). I have not been able to figure out why that is happening.
Thinking more about this, the behavior of the dashboard above is not how I would think the IDC webapp should behave, and I don't know immediately how to mimic that expected behavior in DataStudio.
Let's say we have a filter group that allows the user to select from the list of available analysis results, which in turn are identified by the DOIs in the analysis_results_collections
tables, and the source_DOI
in dicom_all
. When the user selects a given analysis results collection, the filter should result in the counts reflecting the number of cases that have items from the selected analysis results collection(s). Does this make sense @G-White-ISB ?
So as I understand it you would like to have a separate 'analysis collections' list that is separately selectable and the stats calculated would reflect both the analysis and image collections selected. @s-paquette would know more about how difficult the stats calculations then become. Right now it seems we have a 1-1 map between image and an analysis collections. If you select an image collection, why would you not want to select the analysis collection if it is present?
Right now it seems we have a 1-1 map between image and an analysis collections.
No, that's not the case. We already have analysis collections that span cases from multiple original collections. And I am pretty sure we have analysis collections that cover a subset of cases from original collection(s).
Sorry I did not check these analysis collections close enough. I think we need an analysis collection field in the derived data solr node, and then it would be straightforward to handle it in the UI.
Related to #222
Per @s-paquette we will need to have all low caps underscore versions of the IDs for the analysis results. In addition to this:
we will need something like this:
@bcli4d
We do not have the description. It is not available via TCIA API. Agreed to use long title instead of manually scraping longer description from the wiki pages, and also have the DOI URL to the collection wiki page in the tooltip over the analysis collections in the portal UI.
One issue we need to decide is whether we should use DOIs or something else to tag collections.
@bcli4d can you please clarify the process for assigning DOIs to instances?
from @bcli4d:
All the code is in the etl_flow repo: utilities/get_collections_dois.py: get_internal_series_ids() returns a list of IDs of all series in a collection or a patient in a collection. These are internal (to NBIA I guess) IDs; they are not, e.g. SeriesInstanceUIDs. The third_party param controls whether the series returned are in the original data collection (third_party = False), or in some analysis result (third_party=True).
utilities/get_collections_dois.py:get_data_collection_doi() "drills down" to convert those IDs tp SeriesInstanceUIDs. Similarly get_analysis_collection_dois().
To clarify a bit more... At the start of ingesting a collection, I get the original collection DOI, and a list of (third party SeriesInstanceUID, third party DOI). Then when I add a series, if it is in the list of third party series, I use the associated third party DOI. If it is not in the third party series list, then it must be from the original collection, and I assign the original collection DOI. This, of course, assumes that there is a single DOI per original collection, but that is implicit in TCIA/NBIA.
Relevant email communication from TCIA's Kirk Smith re one DOI per SOPInstanceUID assumption, from Feb 28, 2022:
Hi Andrey,
I’ll be giving more thought to this in the coming week, but wanted to reply as I have been out of office the last week.
“The key assumption for us right now is that every DICOM instance can correspond to one and only one entry in either original or analysis collection.”
I believe that statement to be true for Data within NBIA as I think there is a one to one relationship of DOI to a DICOM instance.
In practice that is our goal and it should mostly be true, however, I know there is at least one early case that could lead to confusion. The collection is HNSCC. We first received HNSCC data from a submitting site and called it a collection. The data had a corresponding manuscript. Later on another group from the same site used data from the same trial, had some overlapping subjects (no overlapping series) and had its own related manuscript.
At that point we decided we needed a parent collection that contained all data and two separate Analysis Results collections that contained the related data for each manuscript.
So the HNSCC Collection itself has a DOI and a DOI Landing Page with access to download all of the data. Each of the two Analysis Results pages have their own DOI and the download for them has download access to the DICOM for the portion of the parent collection related to the manuscript.
Our current policy on Analysis Results would not have allowed this, instead the original images would have been the parent collection and only related segmentations etc would have been part of the Analysis Results.
For HNSCC the DICOM images stored in NBIA only have the DOI of the Parent Collection and not of the Analysis Results.
I don’t know if there may be other anomalies, but in general your statement is correct and going forward will be correct per our current policy on Analysis Results.
Adding Justin and Scott Gustafson to the thread.
Thanks,
Kirk
Per discussion today, we should proceed with the implementation and not wait for TCIA revisions to the data model.
Tooltips are now added to the analysis results section. They're not displaying quite right due to the attribute being under Original and not Search Scope; once it moves, they should display properly.
For now, the tooltip is simply the title of the Analysis Results. Anything more complex would need to be added in via the description field of the BigQuery table.
I see that in the dev portal there is a script with an id analysis_results_tooltips, similar to the collections_tooltips script, but it's empty. Running locally I don't see analysis_results_tooltips in the explorer page context
Discussed and decided to add description column for analysis results collections, which will include DOI URL. I will create those and pass to @bcli4d
@G-White-ISB Sorry there, missed a few steps. You can now download a new database see from idc-dev-files and pull from Common master, then refresh the database. That'll get you the new column plus the current analysis results tooltips. (Dev will be done building in a few minutes.)
Analysis collections descriptions passed to @bcli4d here (3rd column of the table): https://docs.google.com/document/d/1JF1UmvMgvEUutmpXz_UAlpbPdFyEen3lEkBR_-CwXtc/edit?usp=sharing.
TCIA has been informed via this ticket: https://help.cancerimagingarchive.net/servicedesk/customer/portal/1/TH-49649.
I converted Andrey's doc to this spreadsheet. It is an easier form to load into BQ, PSQL, etc.
Thanks Bill, I agree. I used text document for the sake of editing convenience.
Tool tips included for each analysis results
Broken out from #590
A few things we need to consider:
As analysis results are effectively series-level, this will have to be a series-applicable attribute filter, so we'll need to support series level filtering in order for this to proceed.