DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Add imaging-specific facets #885

Closed hannes-ucsc closed 5 years ago

hannes-ucsc commented 5 years ago

Entities of interest are imaged_specimen and image_file.

┆Issue is synchronized with this Jira Story ┆Project Name: azul ┆Issue Number: AZUL-549 ┆Epic: Imaging support

zperova commented 5 years ago

No organ is displayed for 1 FOV BaristaSeq mouse SpaceTx dataset even though it is filled out in the spreadsheet - this is likely because the way organ is assigned is linked to cell suspension that is not relevant to imaging datasets. For the imaging datasets the relevant biomaterial is "imaged_specimen". If that's added into the algorithm of how Organ (organ part) is determined this should solve the problem.

zperova commented 5 years ago

The library construction method is irrelevant for imaging datasets, the imaging method should be displayed instead (imaging_protocol.target.assay_type.text or imaging_protocol.target.assay_type.ontology_label)

hannes-ucsc commented 5 years ago

I've asked @danielsotirhos to reindex the samples deployment against DSS staging because I think it will address the organ issue. Once reindexing is done, this link

https://service.samples.dev.explore.data.humancellatlas.org/repository/projects?filters={'file':{'projectId':{'is':['ae5237b4-633f-403a-afc6-cb87e6f90db1']}}}

should 1) have a hit and 2) should list the organ "brain" under specimens.

We also need to add support for imaging_protocol.target[*].assay type. However, @zperova, in the imaging bundle I am looking at, there are dozens of targets under target. Each one has it's own assay_type field. In that bundle these fields all have the same text value ("in situ sequencing") but I fear that there might also be many different values. The data browser can't display an indeterminate number of values in one column. How should we handle that?

zperova commented 5 years ago

@hannes-ucsc thanks, I can see brain now. for the second part - at the moment the assay_type is the same for all targets, but when we get to more complex datasets, it might change. Am I correct to think that putting an assay_type field in the Imaging Protocol would solve this issue?

hannes-ucsc commented 5 years ago

@zperova, if the assay type could potentially be different between targets, pulling that property up into the parent imaging_protocol entity wouldn't work. In that case we'll just have to accumulate all imaging_protocol.target[*].assay_type values into a weighted and bounded set. We would index and display the, say, 10 most frequently used assay types for each protocol. Would that work?

zperova commented 5 years ago

@hannes-ucsc why wouldn't it work to have the assaytype at the imaging(preparation)protocol level? The imaging(preparation)_protocol describes each of the protocols used, so if there are two different assay_types used in the experiment, these can be pulled from there to display without any need of accumulation of values (which I think is a more complicated task). I am thinking along the lines of what is done with the library_construction_method. These are pulled from library_preparation_protocol.library_construction_method.ontology_label so I propose to do the same to assay_type. Or am I wrong?

hannes-ucsc commented 5 years ago

I guess I don't understand. There must be a reason why assay_type is a property of imaging_protocol.target rather than just imaging_protocol. If we pull it up into imaging_protocol then we'd have to make imaging_protocol.assay_type an array and we'd lose the association with target. I have no idea what the right way is. You tell me.

I'm just saying that if 1) there can be many targets and 2) each target could potentially use a different assay type that would imply that there could be many distinct assay types and we'd have to apply some sort of upper bound on the number of assay types because the data browser can't display an arbitrarily large number of values in a single table cell.

Likewise, if we pulled assay type up into imaging protocol, we could still have many assay types (yes?) and we would also need to apply an upper bound.

hannes-ucsc commented 5 years ago

@zperova we are moving forward with indexing the N most frequent assay types from each target in imaging_protocol.targets. When we aggregate multiple imaging_protocol instances—for example to summarize them per project in the Projects tab—we'll take the top M most frequent assay types. Interestingly, this is an approximate process. I can elaborate why if needed. To jog my memory:

N=1, M=2

{x,x,a} => {x:2} {y,y,a} => {y:2} {y,y,a} => {y:2}

[{x:2}, {y:2}, {y:2}] becomes {y:4, x:2} but {y:4, a:3} would be more accurate.

hannes-ucsc commented 5 years ago

We've also decided to not index other properties from imaged_specimen and image_file until we're being explicitly asked to expose particular properties.

The imaging data set is still displayed with the wrong organ but that is due to https://github.com/HumanCellAtlas/data-browser/issues/640. Azul already indexes the organ correctly, going up in the graph through imaged_specimen (instead of cell_suspension) to specimen_from_organism.

zperova commented 5 years ago

I guess I don't understand. There must be a reason why assay_type is a property of imaging_protocol.target rather than just imaging_protocol. If we pull it up into imaging_protocol then we'd have to make imaging_protocol.assay_type an array and we'd lose the association with target. I have no idea what the right way is. You tell me.

You are right - that's the reason we had the target module in the first place. The important part is for a dataset to be identified in the Browser when someone searches a particular assay type. It is my understanding that your approach will accomplish that.

hannes-ucsc commented 5 years ago

The important part is for a dataset to be identified in the Browser when someone searches a particular assay type. It is my understanding that your approach will accomplish that.

Not exactly. If the there are 101 distinct assay types spread over say 1000 target objects we will discard one assay type—the least frequently used one—and the user will not be able to find the dataset by that assay type. However, we can change the thresholds. If you think 100 is too low let me know.

zperova commented 5 years ago

@hannes-ucsc since it is very unlikely that there will be a large number of assay-types per dataset, they all should be represented in the Browser with the threshold of 100.