Closed lassoan closed 1 year ago
@lassoan thank you for the contribution! Is this urgent? If yes, I don't have objections to merge it right away - I most definitely trust you with any code modifications related to Slicer, but if not - I would need a bit more time to review.
Do you have example data sets that I could use to test the examineFilesMultiseries, examineFilesIPPAcqTime, and examineFilesIPPInstanceNumber strategies with "allFiles" (all series merged)?
I will see if I can search and if I can find any such data in IDC. All the data that I can share has been shared on TCIA and is now in IDC, and we might be able to query specific series that have the properties we need.
It is not clear for me why the MultiVolumeImporter and MultiVolumeExporter modules are hosted in separate repositories. What do you think about moving them into the Slicer core?
I don't recall the reasons why they are in different repositories. Maybe because one was initiated before the other, and I didn't think about renaming, or maybe Slicer infrastructure back in the days could not accommodate DICOM plugins and scripted modules in the same repo? The history of this repo dates back to 2012!
I do not have any objection merging this into Slicer core and deprecating these separate repos. If this happens, I would prefer for the commit history to be preserved, but I am not sure this would be possible.
It is not urgent, so it is no problem to wait for your review. However, I've asked for test data, as I think as we add more and more nuanced logic to support new data sets, it is becoming increasingly difficult to ensure that there are no regressions just via code reviews.
Probably your time would be better spent on identifying a few typical data sets and the known tricky data sets and I could set up some automatic tests to ensure that they are all still loaded correctly.
I'll look into porting these modules into Slicer with preserving the history. I've added this ticket to track this: https://github.com/Slicer/Slicer/issues/5935
Here's the first pass on a query that identifies all series that have more than one instance per ImagePositionPatient
(converted to string). There are 272929 series that match this query! So need to add more restrictions to select those that match specific strategies, and test how loading works in Slicer. WIP...
WITH
series_counts AS (
SELECT
STRING_AGG(DISTINCT(collection_id),",") AS collection_id,
STRING_AGG(DISTINCT(Modality),",") AS Modality,
COUNT(DISTINCT(SOPInstanceUID)) AS instance_count,
COUNT(DISTINCT(ARRAY_TO_STRING(ImagePositionPatient,"/"))) AS position_count,
STRING_AGG(DISTINCT(SeriesDescription),",") AS SeriesDescription,
FROM
`bigquery-public-data.idc_current.dicom_all`
GROUP BY
SeriesInstanceUID)
SELECT
*,
instance_count/position_count AS instances_per_position
FROM
series_counts
where position_count > 1
ORDER BY
instances_per_position DESC
Another case has come up that improvements in this PR fix. So, if you don't have any objections, it would be great if it could be integrated.
@lassoan I merged the PR. I am sorry, I completely forgot about this issue. So I revisited the query, and now I select one series that has more than one instance per ImagePositionPatient
, and take a sample of one series for each combination of collection, modality and body part examined.
WITH
series_counts AS (
SELECT
SeriesInstanceUID,
ANY_VALUE(collection_id) AS collection_id,
ANY_VALUE(Modality) AS Modality,
ANY_VALUE(BodyPartExamined) AS BodyPartExamined,
ANY_VALUE(CONCAT(Manufacturer," ",ManufacturerModelName)) AS manufacturerWithModelName,
COUNT(DISTINCT(SOPInstanceUID)) AS instance_count,
COUNT(DISTINCT(ARRAY_TO_STRING(ImagePositionPatient,"/"))) AS position_count,
STRING_AGG(DISTINCT(SeriesDescription),",") AS SeriesDescription,
FROM
`bigquery-public-data.idc_current.dicom_all`
GROUP BY
SeriesInstanceUID
HAVING
position_count>1
AND instance_count/position_count>1),
combinations_numbered AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY collection_id, Modality, BodyPartExamined) AS rn
FROM
series_counts)
SELECT
*
FROM
combinations_numbered
WHERE
rn = 1
Here's the initial rows of the result (complete list is attached):
bquxjob_4d53151_189b7d571dc.csv
If you are interested to use those samples for testing, let me know and I can give pointers how to download the underlying files. I can also refine the selection, if you tell me what criteria you would like to use, and I can give instructions how you can play with those queries to define selection yourself.
examineFilesIPPInstanceNumber strategy is now attempted for each series, not just the "allFiles" merged data set, because this allows loading of some Philips 4D cardiac cine-MRI.
Also made a few small fixes:
@fedorov Two questions: