ImagingDataCommons / idc-index-data

Python package providing the index to query and download data hosted by the NCI Imaging Data Commons
MIT License
1 stars 4 forks source link

Add pathology index #15

Open fedorov opened 5 months ago

fedorov commented 5 months ago

As discussed, we should add a separate index to support pathology applications.

Anticipating potential future similar application/modality-specific indices, maybe we should call it idc_sm_index for SM modality.

@vkt1414 will add the notebook with the initial exploration of this.

We should keep in mind that we are bound by the package limits of PyPI, which is 100MB per upload and 10GB for the overall project size. What this means is that probably we should host SM index outside of the package, and download on demand at runtime. It is not perfect, but at least at the moment I do not see a better option.

vkt1414 commented 5 months ago

@fedorov here you go! nested pathology index (independent of idc-index) is just ~11 MB as opposed to ~880 MB flattened pathology index (dependant on regular idc-index) both with zstd level 22 compression thanks to @jcfr. Flattened index enjoys sub-second performance and queries against nested index are about 45s-1min, which is reasonable.

https://colab.research.google.com/drive/1NTC0uc7_58wj9496VvRB-varYmfCz5QD?usp=sharing

@DanielaSchacherer the latest version of the pathology index includes crdc_instance_uuid so it is possible to get instance-level URLs now, and I included FrameOfReferenceUID as well which do not seem unique to the series, so they are also at instance level. Give it a try whenever you can. As soon as v18 comes out, I should be able to replicate your queries from PW40 notebook.

vkt1414 commented 4 months ago

After looking around a bit to see how optional packages are handled as done in dask https://github.com/dask/dask/blob/main/pyproject.toml#L47-L84 To my understanding, we will need another Pypi package for the pathology index or we need to package them all together (regular index and any indices in the future)

JC's comments also suggest my understanding may be true https://github.com/ImagingDataCommons/idc-index-data/pull/5#discussion_r1527621138

In the future, if these "index-data" archive have different release pace, you could even consider having wheels like idc-index-data-slide-miscrosopy or idc-index-data[slide-miscrosopy]

vkt1414 commented 4 months ago

I tried packaging the 11.2MB nested pathology index but the runner's 16 GB RAM is not sufficient to write the query results to a dataframe followed by parquet.

https://github.com/ImagingDataCommons/idc-index-data/actions/runs/8730112883

Fully flattened table also did not work either on colab with 12.7 GB RAM.

I can think of two solutions:

  1. We can use github large runners, but they are not free.

    https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#per-minute-rates

  2. We can use Jetstream2 VMs as self-hosted GitHub action runners whenever we are updating the indices https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners

fedorov commented 4 months ago

@vkt1414 before reviewing all columns in detail, I would think that we need to drop all columns that are already present in the main index - they can be recovered with a join on SeriesInstanceUID.

vkt1414 commented 4 months ago

If you take a look at the notebook, that's exactly what I do for the flattened table, just extending the index we have. But for the nested version, I was pulling all the columns as the final size was only 11 MB.

fedorov commented 4 months ago

What is the purpose of this query then, which contains all those columns: https://github.com/ImagingDataCommons/idc-index-data/blob/feat-add-pathology-index/scripts/sql/pathology_index.sql ?

I am missing the basics here, I don't understand why I should look at the notebook if the branch and idc-index-data should stand on its own.

vkt1414 commented 4 months ago

Sorry for the confusion. Let me try to clarify.

I tried packaging the 11.2MB nested pathology index

As I mentioned here the nested index was only 11 MB including all the columns i.e. even after duplicating the columns in the current index. I understand you want to save even more by excluding the existing columns, but I was minimizing the complexity of writing queries with the nested index (two subqueries vs three). But it may not matter if I hide the complexity, so sure, I can exclude the existing columns.

I did not submit a PR and request review from the branch. I was only referring to the gha run attempt I tried with the nested index. It did not work, and after that, I was curious about the flatenned index and ran the query that EXCLUDES the current columns in index, and as I mentioned above, it was crashing on colab on a 12.7 GB RAM instance and needed high ram instance. So I did not run the flattened table version of the query on gha.

I don't understand why I should look at the notebook

this issue was created by mentioning, so I was referencing the notebook for more detailed explanation of the choices made while writing those queries.

| @vkt1414 will add the notebook with the initial exploration of this.

fedorov commented 3 months ago

For the reference, various options for instance-level index are listed in https://github.com/vkt1414/idc-index/releases/tag/pathology-index.

DanielaSchacherer commented 2 months ago

Regarding the attributes, that we discussed to add on instance level, I summarize the queries and the results' size:

  1. pixel spacing
SELECT 
SOPInstanceUID, 
-- pixel_spacing: image pixel spacing in mm/px
CAST(SharedFunctionalGroupsSequence[SAFE_OFFSET(0)].
      PixelMeasuresSequence[SAFE_OFFSET(0)].
      PixelSpacing[SAFE_OFFSET(0)] AS FLOAT64) AS pixel_spacing,
FROM bigquery-public-data.idc_current.dicom_all
WHERE Modality='SM' 

Result's size: 22,2 MB

  1. pixel spacing, image type
SELECT 
SOPInstanceUID, 
-- pixel_spacing: image pixel spacing in mm/px
CAST(SharedFunctionalGroupsSequence[SAFE_OFFSET(0)].
      PixelMeasuresSequence[SAFE_OFFSET(0)].
      PixelSpacing[SAFE_OFFSET(0)] AS FLOAT64) AS pixel_spacing,
ImageType[2] as image_type
FROM bigquery-public-data.idc_current.dicom_all
WHERE Modality='SM' 

Result's size: 26,7 MB

  1. pixel spacing, image type, image compression
SELECT 
SOPInstanceUID, 
-- pixel_spacing: image pixel spacing in mm/px
CAST(SharedFunctionalGroupsSequence[SAFE_OFFSET(0)].
      PixelMeasuresSequence[SAFE_OFFSET(0)].
      PixelSpacing[SAFE_OFFSET(0)] AS FLOAT64) AS pixel_spacing,
ImageType[2] as image_type,
 -- compression: compression type
CASE TransferSyntaxUID
      WHEN '1.2.840.10008.1.2.4.50' THEN 'jpeg'
      WHEN '1.2.840.10008.1.2.4.91' THEN 'jpeg2000'
      ELSE 'other'
END AS compression,
FROM bigquery-public-data.idc_current.dicom_all
WHERE Modality='SM' 

Result's size: 31 MB

vkt1414 commented 2 months ago

All of these requested attributes are in nested pathology index I created. Unfortunately, no one wants to take a look at it. It is an open challenge to anyone who can come up with a more efficient way to retrieve instance level info, unnesting in a fool proof way, writing a query in a more understandable way, resulting in an index smaller than 12 MB. Its a failure on my part trying to pitch this any number of times.

fedorov commented 2 months ago

@vkt1414 the earlier discussion was about creating a series-level index, which is now handled by the query we already include in the IDC BigQuery releases (sm_index + the modifications currently under discussion in #30).

The instance-level index is something we should add as a separate table, and is a relatively new discussion.

The queries that you shared need modifications. I did not have time to work on those queries. Every time I look at it, I have more questions.

  1. Right now, if I understand correctly, path_nested_index_v2.parquet in https://github.com/vkt1414/idc-index/releases/tag/pathology-index is the final/recommended query, is this right? If yes, I am confused why it is called nested - what exactly is nested? It does not have aggregation, based on the query provided in the release notes.
  2. That query has one row per distinct combination of attributes. It is not instance-level. If you add SOPInstanceUID, it will have multiple rows for at least some instances.

With the code sequences, multiple of them per instance, we cannot avoid having array columns. You are welcome to rewrite your recommended query to have one row per instance, and we can start from there.

vkt1414 commented 2 months ago

If yes, I am confused why it is called nested - what exactly is nested? It does not have aggregation, based on the query provided in the release notes.

I looked at the v2 nested query and I apologize. I had pasted the flat query instead of the nested one. It should be this one instead, and is already present in the colab notebook you requested me above.

SELECT
  collection_id,
  source_DOI,
  PatientID,
  PatientAge,
  PatientSex,
  StudyInstanceUID,
  StudyDate,
  StudyDescription,
  BodyPartExamined,
  SeriesInstanceUID,
  Modality,
  Manufacturer,
  ManufacturerModelName,
  SeriesDate,
  SeriesDescription,
  SeriesNumber,
  license_short_name,
  CONCAT(series_aws_url,'*')series_aws_url,
  sum(ROUND((SAFE_CAST(instance_size AS float64))/1000000, 2)) AS series_size_MB,
  ARRAY_AGG(
  STRUCT(
  FrameOfReferenceUID,
  crdc_instance_uuid,
  ContainerIdentifier,
  pms.PixelSpacing[0] AS PixelSpacing,
  `Rows`,
  `Columns`,
  TotalPixelMatrixRows,
  TotalPixelMatrixColumns,
  it AS ImageType,
  TransferSyntaxUID,
  pass.CodeValue AS PrimaryAnatomicStructureSequence_CodeValue,
  pass.CodeMeaning AS PrimaryAnatomicStructureSequence_CodeMeaning,
  pass.CodingSchemeDesignator AS PrimaryAnatomicStructureSequence_CodingSchemeDesignator,
  pasms.CodeValue AS PrimaryAnatomicStructureModifierSequence_CodeValue,
  pasms.CodeMeaning AS PrimaryAnatomicStructureModifierSequence_CodeMeaning,
  pasms.CodingSchemeDesignator AS PrimaryAnatomicStructureModifierSequence_CodingSchemeDesignator,
  sds.SpecimenUID,
  spscis.ValueType AS SpecimenPreparationStepContentItemSequence_ValueType,
  spscis_cncs.CodeValue AS SpecimenPreparationStepContentItemSequence_ConceptNameCodeSequence_CodeValue,
  spscis_cncs.CodeMeaning AS SpecimenPreparationStepContentItemSequence_ConceptNameCodeSequence_CodeMeaning,
  spscis_cncs.CodingSchemeDesignator AS SpecimenPreparationStepContentItemSequence_ConceptNameCodeSequence_CodingSchemeDesignator,
  spscis_ccs.CodeValue AS SpecimenPreparationStepContentItemSequence_ConceptCodeSequence_CodeValue,
  spscis_ccs.CodeMeaning AS SpecimenPreparationStepContentItemSequence_ConceptCodeSequence_CodeMeaning,
  spscis_ccs.CodingSchemeDesignator AS SpecimenPreparationStepContentItemSequence_ConceptCodeSequence_CodingSchemeDesignator,
  ops.LightPathFilterPassThroughWavelength,
  ops.IlluminationWavelength,
  ops_itcs.CodeValue AS OpticalPathSequence_IlluminationTypeCodeSequence_CodeValue,
  ops_itcs.CodeMeaning AS OpticalPathSequence_IlluminationTypeCodeSequence_CodeMeaning,
  ops_itcs.CodingSchemeDesignator AS OpticalPathSequence_IlluminationTypeCodeSequence_CodingSchemeDesignator,
  ops_iccs.CodeValue AS OpticalPathSequence_IlluminationColorCodeSequence_CodeValue,
  ops_iccs.CodeMeaning AS OpticalPathSequence_IlluminationColorCodeSequence_CodeMeaning,
  ops_iccs.CodingSchemeDesignator AS OpticalPathSequence_IlluminationColorCodeSequence_CodingSchemeDesignator)) AS Attributes
FROM
  `bigquery-public-data.idc_current.dicom_all` idc
LEFT JOIN
  unnest (ImageType) AS it
LEFT JOIN
  unnest (idc.SharedFunctionalGroupsSequence) AS sfgs
LEFT JOIN
  unnest (sfgs.PixelMeasuresSequence) AS pms
LEFT JOIN
  unnest (SpecimenDescriptionSequence) AS sds
LEFT JOIN
  unnest (sds.PrimaryAnatomicStructureSequence) AS pass
LEFT JOIN
  unnest (pass.PrimaryAnatomicStructureModifierSequence) AS pasms
LEFT JOIN
  unnest (sds.SpecimenPreparationSequence) AS sps
LEFT JOIN
  unnest (sps.SpecimenPreparationStepContentItemSequence) AS spscis
LEFT JOIN
  unnest (spscis.ConceptNameCodeSequence) AS spscis_cncs
LEFT JOIN
  unnest (spscis.ConceptCodeSequence) AS spscis_ccs
LEFT JOIN
  unnest (OpticalPathSequence) AS ops
LEFT JOIN
  unnest (ops.IlluminationTypeCodeSequence) AS ops_itcs
LEFT JOIN
  unnest (ops.IlluminationColorCodeSequence) AS ops_iccs
WHERE
  Modality in ('SM')
GROUP BY
  collection_id,
  source_DOI,
  PatientID,
  PatientAge,
  PatientSex,
  StudyInstanceUID,
  StudyDate,
  StudyDescription,
  BodyPartExamined,
  SeriesInstanceUID,
  Modality,
  Manufacturer,
  ManufacturerModelName,
  SeriesDate,
  SeriesDescription,
  SeriesNumber,
  license_short_name,
  series_aws_url

That query has one row per distinct combination of attributes. It is not instance-level. If you add SOPInstanceUID, it will have multiple rows for at least some instances.

The query I pasted now (again present in the notebook) has crdc_instance_uuid, as I noticed pathology people never seem to use sopInstanceUID. But why crdc_instance_uid? because with this and series' urls, we can build sopInstances' urls and gain space.

With the code sequences, multiple of them per instance, we cannot avoid having array columns. You are welcome to rewrite your recommended query to have one row per instance, and we can start from there.

Once I flatten the table in path_nested_index_v2.parquet, there is no need for any arrays. Unnesting process will look like this. image And yes there may be duplicate rows per sopInstanceUID (or my case crdcInstanceUID) is exactly what is expected when we flatten code sequences right??

And I did take a look at sm-index query, and like you found here left join unnests are the only fool proof way to unnest complex varying data, where some fields can be nulls, which I have been warning since I discovered by them. But I'll paste it again here about the pitfalls of cross join unnests.

https://archive.is/HfoZS

Now let me know how my query does not address everything pathology people need.

DanielaSchacherer commented 2 months ago

Hi Vamsi, sorry, I should have had another look on your queries. Regarding what I posted above, there was the question about the exact size of only these prioritized attributes, that's why I posted it here. Was not meant to replace your queries. My understanding was that we step-by-step wanted to add the most important attributes to avoid an explosion in size.

Another question @vkt1414: why do you think pathology people prefer crdc_instance_uuid over SOPInstanceUID. Just because of my and Andrés notebook?

vkt1414 commented 2 months ago

No problem. I just feel you all are trying to solve a problem that I think solved several months ago. The instance level table in nested format is just 11.2 MB. I repeat 11.2 MB! including series level attributes! Again it is an open challenge from me to anyone to come up with a more efficient index that captures all attributes at instance level. If only my notebook was reviewed, we would have some functionality in idc-index by now. That is my concern.

Regarding crdc_instance_uuid vs sopInstanceUID, I did not see in your exploration to make use of them, crdc_instance_uuid can be thought similar to sopInstanceUID and is actually useful, as we can instance url by concatenating series_aws_url and crdc_instance_uuid, that being said, we can always include sopInstanceUID as well.

DanielaSchacherer commented 2 months ago

Maybe we are :D. I wasn't aware that you waited for a review of the notebook above, sorry! What specific feedback do you need from me? Honestly, I was under the impression that due to other tasks, the pathology idc-index was kind of put on hold and that - in general - you and Andrey only wanted to add as few attributes as possible (and not all of which we discussed at Project Week in January). And as I currently have a use case that would really benefit from the pathology index, I picked this up again now.

As long as crdc_instance_uuid is the only way to build instance level URLs, I strongly vote for that. Especially, because I really want to add functionality to the idc-index to just download single instances instead of a whole series. I think it should not be a problem and I could also just go ahead as soon as we have all necessary attributes and make a pull request with the respective changes.

vkt1414 commented 2 months ago

I can only control what I can control. I'm told there is no urgency for this.

All I personally need to know from you is if

  1. a minute is too long to return a query, while showing the duckdb's progress bar. If so, I can still tackle it. My idea would be ship nested index, unnest and write a flattened version and during initialization of IDCClient. All subsequent queries would be against flat table with query run times under 10s.

  2. Regarding choice of attributes, if I'm able to show I can hold every single attribute the pathology group requested in 11.2 MB, why exactly is there a reason to compromise? I emphasize, I'll not expect a user to unnest anything or work with arrays or structs, I'll do that for them.

vkt1414 commented 2 months ago

@DanielaSchacherer I do have an interim solution for your use case. We recently made all bq tables available for public in AWS buckets. Paste your query here and I'll adjust it to duckdb SQL. So a user still does not require bq, but may have to compromise on performance a bit

DanielaSchacherer commented 2 months ago

a minute is too long to return a query, while showing the duckdb's progress bar. If so, I can still tackle it. My idea would be ship nested index, unnest and write a flattened version and during initialization of IDCClient. All subsequent queries would be against flat table with query run times under 10s.

yes, I would think a minute is a bit too long. Run times under 10s are fine.

2. Regarding choice of attributes, if I'm able to show I can hold every single attribute the pathology group requested in 11.2 MB, why exactly is there a reason to compromise? I emphasize, I'll not expect a user to unnest anything or work with arrays or structs, I'll do that for them.

yes, true. However we might not have encountered problems or ambiguities that come with some attributes (as we didn't have a specific use case to work on it yet). So for the user it would be more easy to see that more attributes are added than to see some of them being removed when we encouter problems. But I am easily convinced otherwise by you and Andrey as long as I'll get my pixel spacing...

fedorov commented 1 month ago

@vkt1414 I understand you are disappointed, but I feel your disappointment is not quite justified. The request for instance-level index is relatively recent in https://github.com/ImagingDataCommons/idc-index/issues/97. The earlier needs of the pathology use case were addressed by the series-level sm_index, which was added in time for the ISBI'24 conference. If I missed the urgency about instance-level index, I am sorry indeed. I did not sense that urgency.

Here is the one question we need to consider: should we include attributes that are available in the primary index table, or should we instead let the user get those by JOINs? Looking forward, I think it is not a sustainable approach to duplicate columns. The number of supporting tables will only grow over time.

Some other issues that need to be addressed:

My proposal - same as before - is to start with the minimum number of columns necessary to address specific use case we have at hand. I would think at this point we definitely need SOPInstanceUID (because it may be used in references), crdc_instance_uuid (need for download), aws_bucket (need it to build the URL), and per-instance image attributes: size, transfer syntax, staining. @DanielaSchacherer can you please complete the list based on the columns you actually need for your current use case?

It is not just about the file size. We need to manage the overall complexity of the data we expose to the user here, and we need to be frugal to save space for more rows and more columns in the future to accommodate the growth of data and complexity of future use cases. Neither is expected to be static over time!

Let's discuss the topics above and then proceed with the PR. I understand we have a use case for instance-level index, I have no intent to stall this feature.

DanielaSchacherer commented 1 month ago

I think I texted this to Andrey in Slack, but never put it into Github. For the current use case (and I would say this applies to a lot of use cases), these are the most important attributes.

Pathology attributes on series level:

  1. Tissue type (healthy, cancerous,...)
  2. Preparation, i.e. Tissue fixative and embedding medium
  3. Staining

Pathology attributes on instance level in decreasing importance in my opinion:

  1. Pixel spacing
  2. Image Type
  3. Image compression
vkt1414 commented 1 month ago

The request for instance-level index is relatively recent in https://github.com/ImagingDataCommons/idc-index/issues/97.

I do not think this is the case. As far as I understand, the pathology has always mentioned that their attributes vary and that series level index would not suffice. And more of then than not, they need to download the individual slides, as Daniela mentioned this several times too.

Here is the one question we need to consider: should we include attributes that are available in the primary index table, or should we instead let the user get those by JOINs? Looking forward, I think it is not a sustainable approach to duplicate columns

I agree on the fact there is no need to duplicate series level attributes again. We can remove every attribute other than seriesInstanceUID among all attributes above sopInstance level

The GROUP BY operation is not necessary - for the instance-level index, we should just use exiting row-per-instance by SOPInstanceUID,

I disagree. First off all, seriesInstanceuid is our key to sopInstanceuIDs. Using an array of struct ensures that we do not need seriesInstanceUID on every row, there by saving a lot of space and keeping the index small. Again, a user will never have to do any unnesting. We will unnest to create a flat table during client initialization. Imagine unzipping a zip file.

It is not just about the file size. We need to manage the overall complexity of the data we expose to the user here, and we need to be frugal to save space for more rows and more columns in the future to accommodate the growth of data and complexity of future use cases. Neither is expected to be static over time!

Can you elaborate this please? If a user is looking at a flat table, how is that complex? They will simply need to know which attributes are necessary for them?

fedorov commented 1 month ago

Using an array of struct ensures that we do not need seriesInstanceUID on every row, there by saving a lot of space and keeping the index small.

I am not sure. I need to experiment with this. It might be that the only column where we are saving by aggregating is SeriesInstanceUID. Others can be joined from the main index. If this is the case, I am not sure it is worth it.

Can you elaborate this please? If a user is looking at a flat table, how is that complex? They will simply need to know which attributes are necessary for them?

The more columns we have, the more complexity there will be. Complexity is not just about what data types its columns are, but also about complexity to comprehend by a human what it contains. I am sure it is possible to flatten the entire dicom_all. I do not think the result will be comprehensible.

vkt1414 commented 1 month ago

I am not sure. I need to experiment with this. It might be that the only column where we are saving by aggregating is SeriesInstanceUID. Others can be joined from the main index. If this is the case, I am not sure it is worth it.

Let me explain again. We package a nested version, so we are not downloading a 880 MB file (once again present in the notebook). But the 11 MB file will explode to ~1GB after unnesting. How would we justify downloading 1 GB over 11 MB, when the end result is all the same, but a few more seconds taken for initialization? In my opinion, it is absolutely worth it.

The more columns we have, the more complexity there will be. Complexity is not just about what data types its columns are, but also about complexity to comprehend by a human what it contains. I am sure it is possible to flatten the entire dicom_all. I do not think the result will be comprehensible.

I did not come with up the columns I chose. The pathology group (namely in an person discussion with Chris, Daniela and David Clunie) wanted them. And this query you are working on is pulling pretty much all columns as I have https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796#issuecomment-2186526198 except for optical path sequence.

DanielaSchacherer commented 1 month ago

I did not come with up the columns I chose. The pathology group (namely in an person discussion with Chris, Daniela and David Clunie) wanted them. And this query you are working on is pulling pretty much all columns as I have ImagingDataCommons/IDC-ProjectManagement#1796 (comment) except for optical path sequence.

That's true, I remember the discussion and that we tried to take everything into account a potential users might need. Still, personally I am fine to start with the most important attributes first and then step-by-step add more attributes. We might have missed some "real-world complexity" like that we have multiple attributes describing the embedding procedure for H&E. Things like that can easily be overseen at first and from a development perspective it might make more sense to step-by-step add attributes that we are sure about instead of taking all of them at once and then having to remove some of them because there is more complexity to them than we knew beforehand. The good thing is that how to access all those attributes technically is already solved by Vamsi.

But in general I feel we should have an in-person discussion about that, because we might be just talking past each other.

fedorov commented 1 month ago

Pathology attributes on series level:

Tissue type (healthy, cancerous,...) Preparation, i.e. Tissue fixative and embedding medium Staining Pathology attributes on instance level in decreasing importance in my opinion:

Pixel spacing Image Type Image compression

@DanielaSchacherer please take a look at the query in the PR above, which captures the attributes above at the instance level. We can meet to discuss it next week.