Dear All,

I am trying to get rid of ambient RNA from my data using SoupX.

I am using following code:

> sc = load10X("/Samples.19.01.2021/MuC/outs/")
Loading raw count data
Loading cell-only count data
Loading extra analysis data where available

> sc = autoEstCont(
+   sc,
+   topMarkers = NULL,
+   tfidfMin = 0.95,
+   soupQuantile = 0.9,
+   maxMarkers = 10,
+   contaminationRange = c(0.01, 0.9),
+   rhoMaxFDR = 0.01,
+   priorRho = 0.001,
+   priorRhoStdDev = 0.1,
+   doPlot = TRUE,
+   forceAccept = FALSE,
+   verbose = TRUE
+ )
764 genes passed tf-idf cut-off and 166 soup quantile filter.  Taking the top 10.
Using 41 independent estimates of rho.
Estimated global rho of 0.11
> out = adjustCounts(sc)
Expanding counts from 13 clusters to 10676 cells.
> plotChangeMap(sc, out, "TTN")

This is the first pic

Now here I get the out into seurat

library(Seurat)
srat = CreateSeuratObject(out)
pbmc <-srat

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc@meta.data %>% head()
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 10)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2), ncol =1)

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

# raw counts, same as pbmc@assays$RNA@counts[1:6, 1:6]
pbmc[["RNA"]]@counts[1:10, 1:10]

# library size normalized and log transformed data
pbmc[["RNA"]]@data[1:6, 1:6]

# scaled data
pbmc[["RNA"]]@scale.data[1:6, 1:6]

pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
pbmc[["RNA"]]@scale.data[1:6, 1:6]

## raw counts and log transformed matrix 
dim(pbmc[["RNA"]]@counts)

dim(pbmc[["RNA"]]@data)

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE)

p1<- DimPlot(pbmc, reduction = "pca")
p1

pbmc@reductions$pca

#the same as 
pbmc[["pca"]]

## methods that work with the DimReduct object
utils::methods(class = 'DimReduc')

# the same as pbmc@reductions$pca@cell.embeddings
# the same as pbmc[["pca"]]@cell.embeddings
Embeddings(pbmc, reduction = "pca") %>% head()

p2<- Embeddings(pbmc, reduction = "pca") %>% 
  as.data.frame()%>% 
  ggplot(aes(x = PC_1, y = PC_2)) +
  geom_point(color = "red", size = 0.5) +
  theme_classic()
p2

CombinePlots(plots = list(p1, p2))

pbmc <- JackStraw(pbmc, num.replicate = 100, dims = 50)
pbmc <- ScoreJackStraw(pbmc, dims = 1:50)

JackStrawPlot(pbmc, dims = 1:30)

ElbowPlot(pbmc, ndims = 50)

mat <- pbmc[["RNA"]]@scale.data 
pca <- pbmc[["pca"]]

# Get the total variance:
total_variance <- sum(matrixStats::rowVars(mat))

eigValues = (pca@stdev)^2  ## EigenValues
varExplained = eigValues / total_variance

varExplained %>% enframe(name = "PC", value = "varExplained" ) %>%
  ggplot(aes(x = PC, y = varExplained)) + 
  geom_bar(stat = "identity") +
  theme_classic() +
  ggtitle("scree plot")

### this is what Seurat is plotting: standard deviation
pca@stdev %>% enframe(name = "PC", value = "Standard Deviation" ) %>%
  ggplot(aes(x = PC, y = `Standard Deviation`)) + 
  geom_point() +
  theme_classic()

pbmc <- FindNeighbors(pbmc, dims = 1:20)
pbmc <- FindClusters(pbmc, resolution = 0.4)

pbmc <- RunUMAP(pbmc, dims = 1:20)

VlnPlot(pbmc, features = c("TTN"))

I don't see that any difference has occurred with SoupX filtering. EIther I am missing something in the fine tuning or I am not doing my seurat section properly.

Could anyone help me.

Many thanks in advance.

See the FAQ in the readme (pasted below).

There's a trade-off between removing more contaminating and accidentally removing real signal in the process. The defaults are tuned to err on the side of keeping true counts. If you want to more aggressively remove contamination, such as TTN, manually set the contamination to something higher like 20%.

My data still looks contaminated. Why didn't SoupX work?

The first thing to do is check that you are providing clustering information, either by doing clustering yourself and running setClusters before adjustCounts or by loading it automatically from load10X. Cluster information allows far more contamination to be identified and safely removed.

The second thing to consider is if the contamination rate estimate looks plausible. As estimating the contamination rate is the part of the method that requires the most user input, it can be prone to errors. Generally a contamination rate of 2% or less is low, 5% is usual, 10% moderate and 20% or above very high. Of course your experience may vary and these expectations are based on fresh tissue experiments on the 10X 3' platform.

Finally, note that SoupX has been designed to try and err on the side of not throwing out real counts. In some cases it is more important to remove contamination than be sure you've retained all the true counts. This is particularly true as "over-removal" will not remove all the expression from a truly expressed gene unless you set the over-removal to something extreme. If this describes your situation you may want to try manually increasing the contamination rate by setting setContaminationFraction and seeing if this improves your results.

constantAmateur / SoupX

I don't get the SoupX correction in Seurat object and in VlnPlot #72

My data still looks contaminated. Why didn't SoupX work?