constantAmateur / SoupX

R package to quantify and remove cell free mRNAs from droplet based scRNA-seq data
253 stars 34 forks source link

Presence of ambient RNA after correction ? #47

Closed akramdi closed 3 years ago

akramdi commented 4 years ago

Hello,

Thank you for the package and the detailed vignette!

I followed the steps to estimate and correct for ambient RNA using a set of genes (manual estimation). I work with tumor samples (tumor cells + normal cells from the microenv).

I chose to use 3 sets of genes to manually estimate the contamination. These genes should be expressed exclusively in a subset of the cells:

HBgenes=c("HBB", "HBA2") #-> expressed in erythrocytes
NORgenes=c("PHOX2B") #-> a transcription factor involved in the development of noradrenergic neuron populations
ENDOgenes=c("ENG", "PTPRB") #->expressed in endothelial cells

Here's a look at the data before running SoupX, I just plot a tsne + the expression of my favourite genes (I used Seurat).

image1 tsne geneExp TD2

We can clearly see the contamination, especially with HBB which seems to be expressed within the PHOX2B+ population.

First of all, here's the output of plotMarkerMap to check the expression of contaminating genes:

image2 plotMarkerMap TD2

I have two questions here:

  1. Expression of PHOX2B seems to be slightly different from the background (light reddish colour) and only a few cells are outlined in green indicating a significant difference. Does this mean we are probably dealing with a strong PHOX2B contamination from the "soup"?

  2. For ENG gene, log2Ratio is very high everywhere but only a few cells show statistically significant difference (outlined in green). I wonder how we should interpret the cells that don't cluster together (no green outline + distributed in the middle). They may be be doublets? This is perhaps beyond the scope of SoupX usage, but if you have any thoughts on this, I'd be glad to hear them!

Finally, here's the output of plotMarkerMap to show the cells on which the estimations are done:

image3 useToEst TD2

and the reported contamination fraction: Estimated global contamination fraction of 9.53%

After running SoupX, I carry on with the analysis using the corrected counts (dimensionally reduction, normalization, clustering..). The first thing I noticed is the presence of background expression of HBB gene away from, what seems to be, the true cluster of erythrocytes (cluster 4). So I am still suspecting the presence of contamination.

image6 umap TD2 image4 HBB umap TD2 image5 HBB exp TD2

  1. Am I correct to suspect the persistence of contamination in some clusters? or is this level of expression too low to be worried about?

Many thanks in advance for the help! Best, Amira

constantAmateur commented 4 years ago

Hi Amira,

Thank you for your very detailed issue. It looks to me like you've done everything right here. You might want to run autoEstCont on your data to check that this gives a similar contamination fraction. But using HB genes to estimate contamination will work well on your data I'm all but certain and 10% is about what I'd expect for reasonable quality tumour data.

Your interpretation of the plotMarkerMap output is not quite right. Anything that is even a little red likely represents true expression of that gene in that cell. It's just that the test used to circle things in green is very strict and so will only mark very red cells. I would be confident that PHOX2B is truly expressed in the majority of your cells.

As to the residual expression of HBB post correction, this is to be expected. SoupX does the best it can, but no method can perfectly identify and remove contamination. So the package has to make a tradeoff between removing as much contamination as possible without removing too much true signal. Our tests indicate this usually removes something like 80-90% of contamination, hence the residual expression of HBB. This is almost certainly still contamination, but to remove it algorithmically we would have to remove true counts at the same time.

You can explore this by using setContaminationFraction to set the contamination fraction to 20% or 30% and then looking at the HBB expression. This can be a useful thing to do when you care about removing contamination above all else. But for most people the default will be what you want and you absolutely expect to see a little contamination left over afterwards. But the level of this contamination will now be about an order of magnitude lower, such that it has little effect on your analysis.

akramdi commented 4 years ago

Hi Matthew,

Thank you for your response. Actually, I realised that I was doing a couple of things wrong (I'll leave this here for future users):

  1. Using the same set of genes for all my samples (I have 10 samples)
  2. Not checking the soup's profile before picking a set of gene

Thanks to plotMarkerDistribution function I was able to decide on some genes (either HBB or immunoglobulin genes for some samples). Very useful! PHOX2B was never among these genes for all 10 samples likely because it's truly expressed as you said.

I am not very comfortable with setting setContaminationFraction so I decided to use the automatic estimation for a couple of samples whose ambient profile contained ubiquitously expressed genes or if I was not able to run plotMarkerDistribution (because the contamination was too high).

About the high contamination error: I have a sample that was flagged very early by CellRanger as having a low fraction reads in cells (25%), indicating a possible high contamination in ambient RNA. I was not able to use plotMarkerDistribution. So I ran SoupX estimation using the automatic mode and I got a very high estimation:

6545 genes passed tf-idf cut-off and 2678 soup quantile filter.
Using 22827 independent estimates of rho.
Estimated global rho of 0.65
Extremely high contamination estimated (0.65).  This likely represents a failure in estimating the contamination fraction.  Set forceAccept=TRUE to proceed with this value.
Subtracting contaminating counts
Expanding counts from 12 clusters to 3304 cells.
  1. Is 65% still reasonable considering that we only have 25% fraction reads in cells to begin with? (fraction reads being reads associated to a cell-containing droplets)

  2. Can you consider making this a warning instead of an error in order for plotMarkerDistribution to still return a plot?

Thanks again, Amira

constantAmateur commented 4 years ago

Hi Amira,

65% is an excessively high contamination fraction. There's no way to say if this is realistic or not without looking at your data in detail. If you have the latest release from CRAN installed you should be able to proceed to obtain a plot by setting forceAccept=TRUE when you call plotMarkerDistribution.

akramdi commented 4 years ago

Thanks! I updated SoupX to the latest version and settingforceAccept=TRUE in plotMarkerDistribution does not seem to work for me (I explicitly use plotMarkerDistribution with no geneset):

#SoupX version
> packageVersion("SoupX")
[1] ‘1.4.5’

> pdf(sprintf("../check_ambient_profile/check_ambient_RNA_.%s.pdf", sampleName))
>  plotMarkerDistribution(sc, forceAccept=TRUE)
No gene lists provided, attempting to find and plot cluster marker genes.
Found 6545 marker genes
Error in estimateNonExpressingCells(sc, nonExpressedGeneList, ...) : 
  unused argument (forceAccept = TRUE)
>  dev.off()  
null device 
          1 
> 
constantAmateur commented 3 years ago

Sorry for the incorrect advice. This should work in 1.4.5 without the need to set anything (i.e., don't set forceAccept=TRUE). Please reopen this issue if the problem persists.

nouhaylaeraysy commented 1 year ago

Hello i'm using soupx on my data i run this line :soup.channel <- autoEstCont(soup.channel,priorRhoStdDev = 20) i get this :5437 genes passed tf-idf cut-off and 1032 soup quantile filter. Taking the top 100. Using 1430 independent estimates of rho. Estimated global rho of 0.40 Message d'avis : Dans setContaminationFraction(sc, contEst, forceAccept = forceAccept) : Estimated contamination is very high (0.4). how i solve this? best