ImagingDataCommons / idc-index-data

Python package providing the index to query and download data hosted by the NCI Imaging Data Commons
MIT License
1 stars 4 forks source link

ENH: add embedding, fixative and staining to sm_index #30

Closed fedorov closed 1 month ago

fedorov commented 2 months ago

Since there are multiple items within a series, the use of arrays becomes unavoidable - unless we switch to instance-level index for SM

Related to https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796.

@DanielaSchacherer please take a look!

DanielaSchacherer commented 2 months ago

Copy from the message that I sent you in Slack regarding the query in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1796:

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

fedorov commented 2 months ago

If we consider only the unique codes across all levels, I don't think this would be problematic. Also, this would allow us to handle those cases where we have different stains for different channels in the base layer.

fedorov commented 2 months ago

I tested it with HTAN, and it does not give me the results I expected to see. Need to spend more time on this!

fedorov commented 2 months ago

The above was due to the use of inner join and invalid assumption that all processing codes we are using are available for all slides! This is fixed now.

fedorov commented 2 months ago

I select the base level to make sure that I am consistently taking the staining & embedding information from a defined instance. These attributes are supposed to be the same across all instances (at least VOLUME instances, I am not 100% sure about the LABEL instances) in a Series. If they are not, then it's a mistake. And to somehow be deterministic André and I decided to always use the base level.

@DanielaSchacherer can you please review the query and the results it is producing?

I think aggregating distinct values over all instances in a series is the right way to do it.

fedorov commented 2 months ago

I will revisit the names for the newly added columns, and add separate columns for designator:code and codeMeaning, to be consistent with what we did in other similar columns already included.

fedorov commented 2 months ago

@DanielaSchacherer how about we call the new columns as follows:

the *_code_designator_value_str columns would have values of form DCM:111744, and *_CodeMeaning is self-explanatory.

This way we would be consistent with the conventions used for illuminationType, primaryAnatomicStructure etc, which I believe we all discussed and agreed upon earlier when we added the table to IDC BQ dataset.

@dclunie do you have any comments?

DanielaSchacherer commented 2 months ago

I know, it's a deviation from the convention, but what about Staining_usingSubstance_code_designator_value_str Staining_usingSubstance_CodeMeaning instead of usingSubstance_code_designator_value_str usingSubstance_CodeMeaning ?

dclunie commented 2 months ago

Coupling the SpecimenPreparationProcedure (of SCT:127790008 "staining") with the "Using substance" value makes sense to me, because in future there could be other substances used for other purposes.

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

DanielaSchacherer commented 1 month ago

yes, true. We should make sure that this is distinguishable from the cases where there is also multiple values per series, but because there are different values for the different instances instead of multiple (but the same) values for all instances. This is not currently not apparent from the query result, do you have an idea how we could achieve it?

@fedorov If we are decided about the naming, we could maybe close this pull request and make another one for the question above.

fedorov commented 1 month ago

I will update the current PR to account for this feedback.

fedorov commented 1 month ago

Also, note that there may be multiple values (and this is common, e.g., H&E uses two "Using substance" entries, one for the H and one for the E.

Yes, and this is already covered by the query, since it will aggregate all distinct values for each of the concepts selected. In this specific case, the result will look like the following (see the last 2 columns).

image

Similarly, for the HTAN multiplexed images, all of the stains encountered across the series instances will be captured.

image

Given this organization, we can select all series that were processed using any given stain or embedding.

fedorov commented 1 month ago

@DanielaSchacherer please take a look at the query, run it in BQ studio to look at the results, and let me know if you have any concerns.

fedorov commented 1 month ago

One concern I have is that you can not deduce from the query's result, if you have multiple values for e.g. staining if they originate from different instances or from the same. Of course, in case of H&E stain people would know that they belong together, but that might not always be true. Right now, I am not sure though about a good solution.

This is expected, as we discussed today. Series-level index has the union of all stains across all resolution levels and instances within the series. Instance-level index in #31 keeps track of this at the instance level.

Another thing which confuses me a little is that for most Series we don't have any information about embedding/tissue fixative. Is that expected?

I was somewhat surprised by this as well. @dclunie is this expected?

fedorov commented 1 month ago

Per @dclunie, fixation and embedding should be non-empty except for CPTAC and HTAN. I will look and report.

fedorov commented 1 month ago

Only the following CPTAC and HTAN collections do not have embedding codes defined.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl

Many TCGA collections do not have fixative code defined, but I guess this is also expected, since some samples were frozen and I assume no fixative was applied.

cptac_aml
cptac_brca
cptac_ccrcc
cptac_cm
cptac_coad
cptac_gbm
cptac_hnscc
cptac_lscc
cptac_luad
cptac_ov
cptac_pda
cptac_sar
cptac_ucec
htan_hms
htan_ohsu
htan_wustl
tcga_acc
tcga_blca
tcga_brca
tcga_cesc
tcga_chol
tcga_coad
tcga_dlbc
tcga_esca
tcga_gbm
tcga_hnsc
tcga_kich
tcga_kirc
tcga_kirp
tcga_lgg
tcga_lihc
tcga_luad
tcga_lusc
tcga_meso
tcga_ov
tcga_paad
tcga_pcpg
tcga_prad
tcga_read
tcga_sarc
tcga_skcm
tcga_stad
tcga_tgct
tcga_thca
tcga_thym
tcga_ucec
tcga_ucs
tcga_uvm
DanielaSchacherer commented 1 month ago

While checking for myself, I saw that temp_table extracts collection_id, but it is not used further. Do we want to output it as well as part of the final result?

fedorov commented 1 month ago

I don't know ... it's a tradeoff between redundancy and duplication of columns available in the main index and convenience of not having to do joins. I am not sure there is a perfect solution to this.

DanielaSchacherer commented 1 month ago

collection_id helps you to avoid joins?