drighelli / SpatialExperiment

54 stars 18 forks source link

Spot and fiducial diameters should be stored somewhere #25

Open LTLA opened 3 years ago

LTLA commented 3 years ago

Probably would go nicely in the metadata of the imgData DF, provided this is well-documented in ?read10xVisium.

HelenaLC commented 3 years ago

Sorry, but what is this? Are these specific to each image, or the whole sample? Is this explained somewhere on the 10x website? I'm guessing the diameter (in some distance unit) of the spots on the slides... so it would be experiment-wide (int_)metadata, no?

LTLA commented 3 years ago

Floating around in scalefactors_json.json. Yes, probably somewhere in int_metadata.

HelenaLC commented 3 years ago

Thinking about this more... the scalefactors_json.json file is unique to each sample. So just thinking out loud...

Thoughts?

LTLA commented 3 years ago

Agree that redundancy is fine; there aren't going to be any changes to the scale factor in the course of the analysis, so we don't have to worry about keeping things in sync. And it also ensures that, e.g., if you replace a image with one of different resolution, you have the ability to change the scale factor for that image only, rather than having to worry about rematching.

You could also store the fiducials in some per-image metadata. Maybe even inside the SpatialImage() itself, if we make the base class extend from Annotated; then we get a metadata() field for free.

drighelli commented 3 years ago

I agree with keeping the whole scale factors associated with each image.

Since you are working on the SpatialImage, do you want to do this @HelenaLC ?

HelenaLC commented 3 years ago

Okay, maybe I wasn't all too clear with what I meant...

  1. Storing the scale factor inside the SpatialImage would of course be cleaner, but essentially the same as what we have now, where we store it "next" to the image.
  2. This doesn't solve the "issue" that we have 4 scale factors per sample, but only 1 scale factor per image, and 2 scale factors that are sample-, not image-specific.

Proposal: An intuitive way of handling this for me is to follow along with how the data comes from Cell Ranger... E.g. imgData could look like this, which is fairly compact IMO:

         images  scaleFactors
sample1  <list>  <list>
sample2  <list>  <list>

where each row is a sample, and images a list of SpatialImages, which could in turn store the scale factor specific to that image (to omit having to match this downstream), i.e. 1 of the entries in scaleFactors.

height & width are gone (we get this from dim(SpatialImage)), sample_ids become rownames, and image_ids could be simply the names of the images list.

LTLA commented 3 years ago

2. This doesn't solve the "issue" that we have 4 scale factors per sample, but only 1 scale factor per image, and 2 scale factors that are sample-, not image-specific.

I don't understand the problem here. Why is the scaling factor sample-specific? Surely it's image-specific; the fact that you have multiple images of the same resolution for a given sample is irrelevant. I assume you're talking about the 1 high + 3 low-res images (or maybe the other way around, I don't remember) associated with each sample.

An intuitive way of handling this for me is to follow along with how the data comes from Cell Ranger... E.g. imgData could look like this, which is fairly compact IMO:

This makes the images harder to search for and get to, because now you need to go inside each list to, e.g., find all tissue section images. You can also imagine future applications where extra image metadata fields (e.g., z-index, strain method) could be added to a DF easily, while relying on names would be a pain. And I also don't understand why scaleFactors is a separate list, and not stored with the SpatialImage.

HelenaLC commented 3 years ago

I meant that the scaling factor file (4 scaling factors) is sample-specific. Agree that each image has 1 scaling factor associated with it; this can go into the SpatialImage metadata if we inherit from Annotated. However, I'm still not sure where to put the spot/fiducial diameter - these are sample-specific, not image-specific. If it were always the case that all samples from from the sample tech generation, these could go into (int_)metadata. But to be general, each sample could come from a different slide and have different diameters. Where do you suppose these should go? Maybe we could have something general like...

sample_id image_id images metadata

where images: list of SpatialImages with one scaleFactor each, and metadata: a list of anything else we/the user wants to put in there?

Anyways, for now, I can put drop scaleFactor column from the imgData and store the scale factors inside the SpatialImage?

LTLA commented 3 years ago

However, I'm still not sure where to put the spot/fiducial diameter - these are sample-specific, not image-specific.

I wouldn't worry about it. As we discussed, redundancy is fine; these things won't change during the course of the analysis.

Anyways, for now, I can put drop scaleFactor column from the imgData and store the scale factors inside the SpatialImage?

I think that's fine.

lmweber commented 3 years ago

Will close this issue for now. This issue may come up again in the future if we have different spot diameters in a new version of the Visium platform (e.g. Visium HD), but this is unlikely to be an issue within the next year or longer.

lmweber commented 2 years ago

As discussed in our meeting just now -- we need to check a few datasets to figure out if the fiducial_diameter_fullres and spot_diameter_fullres values in the scalefactors_json.json file are always the same for all samples in an experiment, or if they can vary slightly between samples / images.

If they are always the same for all samples / images then this is an experiment-wide value, but if they vary slightly between images then it is a sample-specific value.

(E.g. I could imagine slight variation between images if the microscope is a few micrometers higher up or the tissue slide is a few micrometers thinner, then this could affect the image resolution and slightly change these values per sample.)

drighelli commented 2 years ago

A possible discussed implementation:

create a dedicated element inside the metadata to store the entire scalefactors_json values and check them during the cbind process across all the samples. Possibly, indexing the read scalefactors by sample_id

giovp commented 2 years ago

jumping on this conversation cross-referenced in @theislab/zellkonverter#61 .

None of the scalefactors is the same across experiments, and that's because it depends on the pixel resolution of the image. This can be quickly checked e.g. by loading 2 different datasets (hne and fluorescent) from 10x genomics data portal (here parsed in processed format from squidpy).

adata = sq.datasets.visium_hne_adata_crop()
adata.uns["spatial"]["V1_Adult_Mouse_Brain"]["scalefactors"]
{'fiducial_diameter_fullres': 144.48769000651953,
 'spot_diameter_fullres': 89.44476048022638,
 'tissue_hires_scalef': 0.17011142,
 'tissue_lowres_scalef': 0.051033426}
adata = sq.datasets.visium_fluo_adata_crop()
adata.uns["spatial"]["V1_Adult_Mouse_Brain_Coronal_Section_2"]["scalefactors"]
{'fiducial_diameter_fullres': 288.25050000000005,
 'spot_diameter_fullres': 178.44077999999996,
 'tissue_hires_scalef': 0.08250825,
 'tissue_lowres_scalef': 0.024752475}

couple of other reasons why it'd be a good option to keep track of them:

HTH

also, congrats on paper and really excited to have conversions between anndata and SpaExperiment, even if only partial, via zellkonverter! think it'd benefit interoperability/community a ton!

p.s. I see mention of SpatialImage in conversation but couldn't find any reference, is that a new spatial data format of the BioC ecoystem?

HelenaLC commented 2 years ago

@lmweber @drighelli trying to work off some issues... have we come to a conclusion regarding this? Here some options for discussion:

sample_id image_id data metadata
# where 'data' has the 'SpatialImage' and
# 'metadata' everything that's in the .json

sample_id image_id data ...
# where ... is anything else people would like to store
# (so adding columns vs. sticking stuff in a list)

# OR store the scale factor in the SPI?
# if so: extend the class with a numeric slot?

sample_id image_id data scaleFactor metadata
# same as above, but a separate slot for 'scaleFactor'
lmweber commented 2 years ago

Thanks for the reminder @HelenaLC . I feel like storing the whole .json in a metadata column (option 1) is a good idea for long-term reproducibility. At the same time, also calculating the spot/fiducial diameters for each given image (i.e. scaled for the image size) would be more user-friendly for users, e.g. for plotting functions to easily access it.

What about both, e.g. the following columns in imgData (where metadata has the whole .json)?

sample_id  image_id  data  scaleFactor  spotDiameter  fiducialDiameter  metadata
HelenaLC commented 2 years ago

Could you clarify, in this layout, what would go in metadata? I think those columns already cover the whole .json, no?

I see the point in this, which I believe Aaron also raised above, to have separate columns for all metadata to facilitate accessibility. Then again, I don't like how it widens the imgData so much and feel that imgData(spe)$x isn't so much more work than imgData(spe)$metadata$x... So my personal preference would be to keep all in a single metadata column with separate scale factor column OR inside metadata OR stored in the SPI being open for discussion.

Regarding calculating diameters: I personally would prefer offering simple wrappers to do this (i.e., extract resolution and diameters and compute), which could be called by Visium-specific plotting functions. The reason being that these are highly tech-specific parameters that wouldn't exist for e.g. STEREO-seq or ISH-based platforms. But that's just my preference in order to not additionally clutter the imgData (as I think we'd want to keep both the original as well as the resolution-dependent values).

lmweber commented 2 years ago

Yes, these are good points. spotDiameter and fiducialDiameter would be highly specific to the Visium platform, so then imgData does start to get cluttered, especially as new platforms become popular (e.g. MERSCOPE, which is now commercially available).

For the metadata column, I was thinking to put the whole scalefactors_json.json file as a named list, mainly for long-term reproducibility. Some of the values in the .json have a lot of decimals, so it would be good to keep all these full values, regardless of what else we do with them (e.g. calculating scaled spot sizes).

But you are right that it then becomes somewhat redundant to also store calculated values in additional columns (spotDiameter, fiducialDiameter, etc, and in fact also scaleFactor). Separate wrapper functions that are called by Visium-specific plotting functions that access values via $metadata$... would also work, as you suggested.

On the other hand, Visium is by far the most popular platform right now (I'm thinking of Fig 3a in Lambda Moses's "Museum of Spatial Transcriptomics" paper), so maybe it is ok to have some additional customizations just for Visium. I'm not sure. This would also mean we would need to be open to including similar customized columns for other platforms in the future. Maybe one way to think about it would be that for any platform where we include a standard loading function (e.g. read10xVisium(), maybe readMERSCOPE() in the future), we would provide customized columns in imgData as necessary, but for other platforms people need to add them / calculate them themselves. I think there are arguments both ways here -- I don't mind if imgData is quite wide, but we want to make sure we are not committing to adding too many new customizations in the future, which will create more maintenance work.

LTLA commented 2 years ago

Having considered the options, most sustainable way to go seems to be a metadata column containing a List. This can be treated as a giant dump of image-specific miscellany, which means that readers for other technologies can easily store whatever they want in there. So at the very least, we're not losing any technology-specific information.

The problem is when we want to provide guarantees to downstream methods about the existence of certain fields in this metadata column. We wouldn't be able to do that with a loose metadata column, but perhaps we can provide some recommendations, e.g., "metadata should contain fidicual_diameter and spot_diameter for Visium data".

If we want to give hard guarantees, then we could create VisiumExperiment subclass that guarantees the existence of a fiducial_diameter column, etc. This seems like overkill and increases fragmentation of the class hierarchy.

LTLA commented 1 year ago

Nudging this. Putting these parameters somewhere would be helpful for downstream applications.

A very simple solution would be to have the virtual SpatialImage inherit from Annotated, then we can tuck any miscellaneous image-specific gunk into the SpatialImage's metadata() field. This avoids the need to create a dedicated column in the imageData.

lmweber commented 1 year ago

Thanks for the reminder.

Yes, maybe putting it all in metadata() in this way, along with some recommended guidelines for what should be included for specific platforms (especially Visium) is a good compromise between generality and platform-specificity.

drighelli commented 1 year ago

Yes, I also think that the best way is to store them inside the metadata by indexing per sample_id

lmweber commented 1 year ago

Ok thanks, I'll try this out with an example

lmweber commented 1 year ago

Rebasing SpatialImage classes on Annotated in PR #122 now enables storing this in metadata.

To do: update read10xVisium() and add examples as discussed in PR #122 .