Closed emine2441 closed 9 months ago
There are functions called compareSCEs()
and diffSCEs()
in {splatter} which you can use to make these kinds of plots. You can also find the code for making the figures in the paper here (as mentioned in the paper) https://github.com/Oshlack/splatter-paper. The code might not be compatible with newer package versions but it should still be a good guide.
thanks. I also came across something like this. While the number of genes in real dataset is 23,000, the number of genes drops drastically when simulating in the lun2 simulation and even when estimating parameters. in other words, the number of genes as a result of simulation can be about 12.000. is this normal? and does it cause problems when comparing?
The estimation functions should adjust the number of genes to match the real dataset you are estimating parameters from. Please share your code if that's not happening.
I think the compareSCEs()
function should handle objects with different dimensions but diffSCEs()
may not.
data2 <- read.csv("E:/dataset_2.csv", header = T, row.names = 1)
annotation2 <- read.csv("E:/annotation_2.csv", row.names = 1)
sce2 <- SingleCellExperiment(
assays = list(counts = as.matrix(data2)),
colData = annotation2
)
plates2 <- as.numeric(factor(colData(sce2)$plate.barcode))
params2_lun2 <- lun2Estimate(sce2, plates2, min.size = 20)
sim2_lun2 <- lun2Simulate(params2_lun2)
here data2 is : 23,433 entries, 3370 total columns
Moreover
sim1 <- splatSimulate(nGenes = 1000, batchCells = 20)
comparison <- compareSCEs(list(Splat = sim1, lun2 =sim2_lun2))
When I run the code, I get the error "Error in .local(x, ...) : size factors should be positive". The reason for this is the sce object named sim2_lun2, which is created as a result of lun2 simulation. No matter what I did I couldn't fix it
data2 <- read.csv("E:/dataset_2.csv", header = T, row.names = 1) annotation2 <- read.csv("E:/annotation_2.csv", row.names = 1) sce2 <- SingleCellExperiment( assays = list(counts = as.matrix(data2)), colData = annotation2 ) plates2 <- as.numeric(factor(colData(sce2)$plate.barcode)) params2_lun2 <- lun2Estimate(sce2, plates2, min.size = 20) sim2_lun2 <- lun2Simulate(params2_lun2)
here data2 is : 23,433 entries, 3370 total columns
I'm not sure what the issue is here. Estimating the parameters gives a simulation with the same dimensions as the original data.
library(splatter)
sce <- scuttle::mockSCE()
params <- lun2Estimate(sce, plates, min.size = 20, verbose = FALSE)
sim <- lun2Simulate(params, verbose = FALSE)
identical(dim(sce), dim(sim))
#> [1] TRUE
When I run the code, I get the error "Error in .local(x, ...) : size factors should be positive". The reason for this is the sce object named sim2_lun2, which is created as a result of lun2 simulation. No matter what I did I couldn't fix it
I think what is happening here is that the simulation has generated one or more cells with zero counts. Please check if this is the case and try removing them before running the comparison function.
Could these genes be intersecting genes of all batches? So the number of genes common to all of them could be 13520? (13520 is number of genes after simulation) So maybe that's why the gene count is going down? how can I check this?
I'm not sure what you mean. I still haven't seen where the number of genes might be changing. Maybe you could include some code and output showing the number of genes at different stages.
This is what happens when you set the gene count to 10,000
sce <- scuttle::mockSCE(ngenes=10000, ncells=10)
plates <- as.numeric(factor(colData(sce)$Mutation_Status))
params <- lun2Estimate(sce, plates, min.size = 5, verbose = FALSE)
sim <- lun2Simulate(params, verbose = FALSE)
identical(dim(sce), dim(sim))
#> [1] FALSE
This is because the estimation procedure doesn't work correctly when there are so few cells so you get a slightly lower number genes estimated (I'm not sure exactly why, possibly because there are some genes with zero counts). If you had a larger number of cells this would be less likely to happen but pre-filtering the data that is being estimated from would be the safer approach if having exactly the same number of genes is important.
Actually, here I have limited the number of cells to 10 so that the estimation process will be short. So my own dataset has about 23,000 genes and 4000 cells. And the number of genes is dropping drastically beyond estimation. For example, the number of genes from 23,000 drops to around 13,000. In fact, as a result of the estimation process, the number of genes changes according to the plates variable I choose. For example, when I give colData(sce2)$plate.barcode as plates, I get different gene counts and when I give colData(sce2)$cluster.ids. Is this because the number of genes is reduced to the common number of genes in all batches as a result of the estimation process? Meanwhile, I checked the data. In addition, I leave the drive link of the data as an idea.
https://drive.google.com/file/d/1EsY6980jcnruC7zr61vRAm98HDqn16dI/view?usp=drive_link https://drive.google.com/file/d/1DsoC6yHMaUbXzDYw5ol5Z-qkOMKlLRKA/view?usp=drive_link
BPPARAM
argumentplates
is set differently. It should correspond to samples.lun2Estimate()
function is working correctly but unlike other simulations is does not always return the same number of genes as the input data.how can i check lowly expressed genes?
Just as you would for standard quality control by looking at the mean expression or total counts. You can calculate that using many standard analysis packages ({scater}/{scuttle}, {Seurat} etc.)
Why are the cell and gene names gene_001 gene_002 in these simulations? So, for example, don't the 1st gene in the real data set and the 1st gene in the simulation data represent the same genes?
We can't use the original gene names as a different number of genes might be simulated than what is in the original dataset. If the numbers are the same I think they should match up though (but I wouldn't rely on this without checking).
Closing this issue but please reopen if you have further questions.
I am using your splatter tool. In your article "Splatter: simulation of single-cell RNA sequencing data", I want to make a comparison like in Fig.2. I couldn't find how to do it. Can you help me?