Part 2 benchmarking spatial transcriptomics data

allyhawkins commented 2 years ago

Here I am breaking out the second half of the notebook originally filed in #149 that address the comparison of quantifying spatial transcriptomics libraries with Alevin-fry + Spaceranger vs. Spaceranger alone into its own notebook. I have also paired down the analysis to only include comparison of one sample across both tools to decrease the complexity and make things easier to understand. Here I am looking at the same metrics that I looked at in the previous PR: UMI/ spot, genes detected/ spot, mean gene expression and overlapping genes. I also have included the plots to show the distribution of UMI/spot and genes detected/spot using both tools with and without the underlying image.

In doing this, I added a few custom functions to the benchmarking-functions/R folder, specifically a function to read in the spaceranger output as a SpatialExperiment object and then two functions to grab the colData and rowData from an spe and convert to a data.frame. I'm hoping that by breaking these out here it will make the notebook less confusing. In the future I am anticipating switching some of these functions over from here to scpcaTools.

The one function that I did not break out was the plotting function, since I was using that specifically in this notebook, but I can break them out if reviewers think that is necessary. I had to create two plotting functions, one for alevin-fry + spaceranger created spes and one for spaceranger only spes because of the bug that was identified in SpatialExperiment::read10XVisium() where the axes are flipped.

Note that this is stacked on #149.

allyhawkins commented 2 years ago

I believe I addressed the majority of your comments that you left @jashapiro. I modified the function that we use to import the spaceranger only files to not use read10XVisium, and now do it similar to the way I am importing the fry+ spaceranger files but without subsetting to only spots found in both tools.

I also removed the function to grab the rowData and convert to a data frame and switched to using the function I already had in scpcaTools. I did not remove the colData function because as part of the new function I wrote here, I am also merging the colData with the spatialData to get the in_tissue column specifically for each spe. I figured in a separate PR if needed I can modify the function in scpcaTools to incorporate the spatialData or we can just wait until they do the reorganization and then if we need to use these functions again we can use the functions as they are in scpcaTools.

I did notice when going back through and examining things a bit more closely that the Alevin-fry + Spaceranger does tend to have higher counts and genes detected per cell and you can see that in the spatial plots as well that the patterning, although similar, does look denser. I also am not quite sure which one is "correct" or what to think about, but definitely food for thought that these don't match up nearly as nicely as with the single-cell data.

I also did some playing around with the genes that don't have as high correlations between their mean expression values. After playing around with it a bit, I kept in the calculation of the fold change between mean gene expression in alevin-fry vs. spaceranger and highlighted genes on the graph with high or low fold change. I also did over enrichment analysis and noticed that no significant GO terms came up. When scrolling through the gene names you do see a lot of ribosomal and translation associated genes and then some HOX and developmental genes as well that are on the outer edges of the correlation scatter plot, but I wasn't sure the best way to do this analysis and am open to other ideas. I also played around with a linear regression analysis and plotted the genes with high residuals and those are the same genes that are showing up in this analysis as well. If you had a particular idea of what you were looking for here please let me know.

Here is a link to the updated html file for reference.

jashapiro commented 2 years ago

I haven't had time for a full review here, but I was looking at the outputs, and I think some error might have been introduced in the latest changes: the images for spaceranger don't look correct now:

I feel like the distributions are odd too?

allyhawkins commented 2 years ago

Thanks for catching this @jashapiro! After taking another look at this, it looks like using the read10XVisium function is doing something slightly different than how we were creating it manually. I went back through and couldn't quite figure out why that was and it looks like the main difference was that function appears to read in the HDF5 files by default, but I don't think that should be the difference. I removed the create_spaceranger_spe function that I had made and am now going back to using read10XVisium since that has been fixed and the images now all look as they should. Interestingly, Spaceranger shows higher counts, genes detected, and mean gene expression than Alevin-fry.

Here's the updated html.

jashapiro commented 2 years ago

It looks like you are using the SpatialExperiment version from GitHub, which is fine, but I think that should be explicitly noted in the notebook (it is in the sessioninfo, but I might not expect people to look there right away). (I had a thought finally about the reason for the plots above: the coordinates and cells were presumably not getting read in the same order, so there was some re-sorting that needed to be done. That would make the density plots include spots that were not really in tissue, resulting in the bimodal distribution)

Finally getting to thinking about the results here: I am a bit surprised by the fact that alevin-fry is performing so much worse. I assume we are using the same settings as with the single-cell data for the mapping and quantification? I looked back at earlier results and it does seem that sometimes there is a slight difference between AF & cellranger, but we haven't seen it to this degree before?

I know I had asked you to focus on one sample to reduce confusion, but in this case maybe we will need to look at more than one to know how to proceed!

allyhawkins commented 2 years ago

It looks like you are using the SpatialExperiment version from GitHub, which is fine, but I think that should be explicitly noted in the notebook (it is in the sessioninfo, but I might not expect people to look there right away).

I added in a note about this at the beginning of the notebook.

I am a bit surprised by the fact that alevin-fry is performing so much worse. I assume we are using the same settings as with the single-cell data for the mapping and quantification? I looked back at earlier results and it does seem that sometimes there is a slight difference between AF & cellranger, but we haven't seen it to this degree before?

I know I had asked you to focus on one sample to reduce confusion, but in this case maybe we will need to look at more than one to know how to proceed!

I also was surprised by the difference here as with the single-cell and single-nuclei we saw a lot more overlap between cellranger and alevin-fry. I added the second sample back in here and I am still seeing the same trend. I am not so inclined to use Alevin-fry here and am not seeing a particular reason to include it, as Spaceranger seems to be capturing more information.

AlexsLemonade / alsf-scpca

Part 2 benchmarking spatial transcriptomics data #151