Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
GNU General Public License v3.0
213 stars 56 forks source link

splatPop (unknown correlation between gene expression level in different cell types) #147

Closed Fatima-Zare closed 2 years ago

Fatima-Zare commented 2 years ago

Hi,

I'm using Splatter to generate single-cell simulated data.

I'm using these parameters:

###generate single-cell simulated object
K=5### number of Cell types
Ng=10 ###number of genes
Ns=20###number of samples
vcf <- mockVCF(n.samples = Ns)
gff <- mockGFF(n.genes = Ng)
params.group <- newSplatPopParams(batchCells =100,#Number of cells in each batch.
                                  similarity.scale =1,
                                  de.downProb = c(0.1, 0.4, 0.3, 0.6, 0.5),
                                  de.prob = c(0.3, 0.1, 0.2, 0.01, 0.4),
                                  de.facLoc = c(0.6, 5, 0.1, 0.01, 2), 
                                  de.facScale = c(0.1, 0.4, 2, 0.5, 0.4),
                                  group.prob = rep(1/K,K),
                               )
sim.means <- splatPopSimulateMeans(vcf = vcf, gff = gff,
                                   params = params.group)
sim.sc.gr <- splatPopSimulateSC(params=params.group, 
                                key = sim.means$key,
                                sim.means=sim.means$means,
                                sparsify = FALSE)

sce= sim.sc.gr
sce=logNormCounts(sce)
SCcount=assays(sce)$logcounts

Then you can see the heatmap of the normalized count matrix (SCcount) in figure 1. NormalizedCountMatrix

Then I aggregated the SCcount matrix across cluster-sample groups.

###Aggregate count matrix across cluster-sample groups
pb <- aggregate.Matrix(t(assays(sce)$logcounts), 
                       groupings = groups, fun = "mean") 

annot_cols = data.frame(
  Group = rep(apply(expand.grid(c("CellA","CellB",'CellC',"CellD","CellE")), 1, paste, collapse="."), each=Ns), 
  row.names = colnames(t(pb))
)
g=pheatmap::pheatmap(t(pb),annotation_col = annot_cols,cluster_rows = F,cluster_cols = F,show_colnames = F)
ggsave(g,file=paste(path,'sample-Celltypeheatmap.pdf',sep=""),width = 30,height = 10,limitsize = FALSE)

you can see the aggregated Sample-Celltype matrix in figure 2: sample-Celltypeheatmap

Then, from the aggregated Sample-Celltype matrix, I made another matrix to show gene expression levels in each CellType across all samples like:

###gene expression levels in each CellType across all samples
gen=c(7,9,4,10,3,1,2,8,5,6)
cellg=matrix(c(as.matrix(pb)),nrow=Ng*K,ncol=Ns,byrow=TRUE)

rownames(cellg)=apply(expand.grid(c(paste('celltype',LETTERS[1:K],sep='')),c(paste('g',gen,sep = ''))), 1, paste, collapse=".")
colnames(cellg)=apply(expand.grid(c(paste('S',1:Ns,sep = ''))), 1, paste, collapse=".")

annot_rows= data.frame(
  Group = rep(apply(expand.grid(c(paste('g',gen,sep = ''))), 1, paste, collapse="."), each=5), 
  row.names = rownames(cellg)
)
g=pheatmap::pheatmap(cellg,cluster_rows = F,cluster_cols = F,fontsize = 10,annotation_row =annot_rows,show_rownames = F )
ggsave(g,file=paste(path,'cellgheatmap.pdf',sep=''),width = 10,height = 5,limitsize = FALSE)

you can see the heatmap of the cellg matrix in figure 3:

cellgheatmap

Then I calculate the correlation between rows of Matric cells.

###correlation between each rows of cellg matrix
mm=data.frame(t(cellg))
cor(mm)
g= pheatmap::pheatmap(cor(mm),cluster_rows = F,cluster_cols = F,display_numbers = F,fontsize = 20)
ggsave(g,file=paste(path,"Corelation.pdf",sep=""),dpi = 1000,width = 50,height = 50,limitsize = FALSE)

you can see the heatmap of this correlation in figure 4.

Corelation

Now I have some questions:

In figure 3, we can see an unknown structure (the blue blocks). It seems some genes are off in all of the cell types. This leads to a high correlation between rows of the matrix cellg (As you can see in figure 4).

1- How can I get rid of the blue and red blocks in figure 3 and also the red blocks in the Correlation matrix (figure4). 2- How are samples generated in the Splatter object? 3-Is it a multiplicative factor to create the samples? Is each sample just a multiplicative version of all the others?

I appreciate your help.

lazappi commented 2 years ago

Duplicate Bioconductor forum issue https://support.bioconductor.org/p/9143809/

azodichr commented 2 years ago

Hi @Fatima-Zare, Unfortunately I am having a difficult time understanding what you are trying to achieve in terms of "getting rid of the blue and red blocks". In Figure 3 that indicates genes that are simulated to be more highly or lowly expressed for the specific donor-cell-gene. For example, G4 looks to be simulated as a highly expressed gene in all celltypes, while G6 was simulated to have higher expression in celltype C. If you do not want these type of cell-type specific effects you can remove them from your simulation or you can reduce the DE effect sizes by adjusting the de parameters. I would also think the red boxes in Fig 4 are desired, you would expect gene A expression levels in cell-type X to be more correlated with gene A expression levels in cell-type Y than with gene B expression levels. Maybe I am missing something here?

The approach used to simulate data for individuals in the population is described in detail here and the details for how the values are simulated for each cell are described in detail here. Note that while multiplicative factors are being used to do things like simulate DE and eQTL, they are not the only difference between samples.

Fatima-Zare commented 2 years ago

thank you for your reply. In figure 3, For some donors, gene 9 and gene 7 are both off in all cell types. How can I get rid of these big blue boxes? Is that normal that a gene expression level is zero for all Cell types? Is that related to DE effect or multiplicative factor?

azodichr commented 2 years ago

With single cell data zeros are quite common for lowly expressed genes (https://www.nature.com/articles/s41588-021-00873-4.pdf?proof=t) - and if some gene for some donor is sampled to have a mean exp at or close to zero then the cell group specific DE/eQTL effects will have no/little impact.

lazappi commented 2 years ago

Hi @Fatima-Zare were we able to answer your question? Please comment if anything is still unclear.

lazappi commented 2 years ago

I'm going to close this now but please reopen if you need extra information