Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
217 stars 57 forks source link

function in splatter for comparing simulations #166

Closed emine2441 closed 9 months ago

emine2441 commented 1 year ago

I am using your splatter tool. In your article "Splatter: simulation of single-cell RNA sequencing data", I want to make a comparison like in Fig.2. I couldn't find how to do it. Can you help me?

lazappi commented 1 year ago

There are functions called compareSCEs() and diffSCEs() in {splatter} which you can use to make these kinds of plots. You can also find the code for making the figures in the paper here (as mentioned in the paper) https://github.com/Oshlack/splatter-paper. The code might not be compatible with newer package versions but it should still be a good guide.

emine2441 commented 1 year ago

thanks. I also came across something like this. While the number of genes in real dataset is 23,000, the number of genes drops drastically when simulating in the lun2 simulation and even when estimating parameters. in other words, the number of genes as a result of simulation can be about 12.000. is this normal? and does it cause problems when comparing?

lazappi commented 1 year ago

The estimation functions should adjust the number of genes to match the real dataset you are estimating parameters from. Please share your code if that's not happening.

I think the compareSCEs() function should handle objects with different dimensions but diffSCEs() may not.

emine2441 commented 1 year ago
data2 <- read.csv("E:/dataset_2.csv", header = T, row.names = 1) 
annotation2 <- read.csv("E:/annotation_2.csv", row.names = 1)
sce2 <- SingleCellExperiment(
  assays = list(counts = as.matrix(data2)),
  colData = annotation2
)

plates2 <- as.numeric(factor(colData(sce2)$plate.barcode))
params2_lun2 <- lun2Estimate(sce2, plates2, min.size = 20)
  sim2_lun2 <- lun2Simulate(params2_lun2)

here data2 is : 23,433 entries, 3370 total columns

emine2441 commented 1 year ago

Moreover

sim1 <- splatSimulate(nGenes = 1000, batchCells = 20)
comparison <- compareSCEs(list(Splat = sim1, lun2 =sim2_lun2))

When I run the code, I get the error "Error in .local(x, ...) : size factors should be positive". The reason for this is the sce object named sim2_lun2, which is created as a result of lun2 simulation. No matter what I did I couldn't fix it

lazappi commented 1 year ago
data2 <- read.csv("E:/dataset_2.csv", header = T, row.names = 1) 
annotation2 <- read.csv("E:/annotation_2.csv", row.names = 1)
sce2 <- SingleCellExperiment(
  assays = list(counts = as.matrix(data2)),
  colData = annotation2
)

plates2 <- as.numeric(factor(colData(sce2)$plate.barcode))
params2_lun2 <- lun2Estimate(sce2, plates2, min.size = 20)
  sim2_lun2 <- lun2Simulate(params2_lun2)

here data2 is : 23,433 entries, 3370 total columns

I'm not sure what the issue is here. Estimating the parameters gives a simulation with the same dimensions as the original data.

library(splatter)
sce <- scuttle::mockSCE()
params <- lun2Estimate(sce, plates, min.size = 20, verbose = FALSE)
sim <- lun2Simulate(params, verbose = FALSE)
identical(dim(sce), dim(sim))
#> [1] TRUE

When I run the code, I get the error "Error in .local(x, ...) : size factors should be positive". The reason for this is the sce object named sim2_lun2, which is created as a result of lun2 simulation. No matter what I did I couldn't fix it

I think what is happening here is that the simulation has generated one or more cells with zero counts. Please check if this is the case and try removing them before running the comparison function.

emine2441 commented 1 year ago

Could these genes be intersecting genes of all batches? So the number of genes common to all of them could be 13520? (13520 is number of genes after simulation) So maybe that's why the gene count is going down? how can I check this?

lazappi commented 1 year ago

I'm not sure what you mean. I still haven't seen where the number of genes might be changing. Maybe you could include some code and output showing the number of genes at different stages.

emine2441 commented 1 year ago

This is what happens when you set the gene count to 10,000

sce <- scuttle::mockSCE(ngenes=10000, ncells=10)
plates <- as.numeric(factor(colData(sce)$Mutation_Status))
params <- lun2Estimate(sce, plates, min.size = 5, verbose = FALSE)
sim <- lun2Simulate(params, verbose = FALSE)
identical(dim(sce), dim(sim))
#> [1] FALSE
lazappi commented 1 year ago

This is because the estimation procedure doesn't work correctly when there are so few cells so you get a slightly lower number genes estimated (I'm not sure exactly why, possibly because there are some genes with zero counts). If you had a larger number of cells this would be less likely to happen but pre-filtering the data that is being estimated from would be the safer approach if having exactly the same number of genes is important.

emine2441 commented 1 year ago

Actually, here I have limited the number of cells to 10 so that the estimation process will be short. So my own dataset has about 23,000 genes and 4000 cells. And the number of genes is dropping drastically beyond estimation. For example, the number of genes from 23,000 drops to around 13,000. In fact, as a result of the estimation process, the number of genes changes according to the plates variable I choose. For example, when I give colData(sce2)$plate.barcode as plates, I get different gene counts and when I give colData(sce2)$cluster.ids. Is this because the number of genes is reduced to the common number of genes in all batches as a result of the estimation process? Meanwhile, I checked the data. In addition, I leave the drive link of the data as an idea.

https://drive.google.com/file/d/1EsY6980jcnruC7zr61vRAm98HDqn16dI/view?usp=drive_link https://drive.google.com/file/d/1DsoC6yHMaUbXzDYw5ol5Z-qkOMKlLRKA/view?usp=drive_link

lazappi commented 1 year ago
emine2441 commented 1 year ago

how can i check lowly expressed genes?

lazappi commented 1 year ago

Just as you would for standard quality control by looking at the mean expression or total counts. You can calculate that using many standard analysis packages ({scater}/{scuttle}, {Seurat} etc.)

emine2441 commented 1 year ago

Why are the cell and gene names gene_001 gene_002 in these simulations? So, for example, don't the 1st gene in the real data set and the 1st gene in the simulation data represent the same genes?

lazappi commented 1 year ago

We can't use the original gene names as a different number of genes might be simulated than what is in the original dataset. If the numbers are the same I think they should match up though (but I wouldn't rely on this without checking).

lazappi commented 9 months ago

Closing this issue but please reopen if you have further questions.