Closed obotman closed 1 year ago
Please provide a reproducible example, otherwise this is hard to understand and I can not help… also I’m struggling with understanding a couple sentences. E.g., what do you mean by “for celltype NA gene names got lost”? Have you tried setting these to something like “untitled”?
I just replace the NA gene names by "unknown" and now I've this error in:
rem: i'm very new here, is it possible to put big file as example in GitHub?
NA gene name? I thought you said celltype name? …please be specific. - no, I didn’t mean uploading the data. But putting line by line the code leading up the error as well as relevant intermediate outputs that might help understand what’s causing the issue.
Here is an abstract of the code I use:
sce <- as.SingleCellExperiment(seuratObject, assay="SCT")
refDICE <- DatabaseImmuneCellExpressionData()
SubDICE <- SingleR(test=sce, ref=refDICE, labels=refDICE$label.fine)
sce@colData$annotationSubDICE<-SubDICE$pruned.labels
sce$annotationSubDICE[is.na(sce$annotationSubDICE)]<-"unknown"
sce$treatment <- dplyr::recode(sce$SampleID,
ABE157 = "Im1wt", ABE158 = "Im1wt", ABE159 = "Im1no", ABE160 = "Im1no", ABE161 = "Im1Im1", ABE162 = "Im1Im1",
ABE163 = "wtwt", ABE164 = "wtwt", ABE165 = "wtno", ABE166 = "wtno", ABE167 = "wtIm1", ABE168 = "wtIm1",
ABE169 = "Im2wt", ABE170 = "Im2wt", ABE171 = "Im2no", ABE172 = "Im2no", ABE173 = "Im2Im2", ABE174 = "Im2Im2"
)
contrasts_oi <- c("'Im1Im1-(Im1wt+Im1no)/2','Im2Im2-(Im2no+Im2wt)/2','wtwt-(wtIm1+wtno)/2'")
contrast_tbl <- data.frame(contrast = c("Im1Im1-(Im1wt+Im1no)/2","Im2Im2-(Im2no+Im2wt)/2","wtwt-(wtIm1+wtno)/2"), group = `c("Im1Im1","Im2Im2","wtwt"))`
cluster_id="annotationSubDICE"
sample_id="SampleID"
group_id="treatment"
muscat <- muscat_analysis(
sce = sce,
celltype_id = celltype_id,
sample_id = sample_id,
group_id = group_id,
covariates = NA,
contrasts_oi = contrasts_oi,
contrast_tbl = contrast_tbl,
assay_oi_pb = "counts",
fun_oi_pb = "sum",
de_method_oi = "edgeR",
min_cells = 10,
verbose=TRUE)
When I don't replace the NA celltype name with "unknown", I got this message:
"[1] "Calculate expression information"
Error in rownames<-
(*tmp*
, value = all_genes) :
attempt to set 'rownames' on an object with no dimensions
In addition: Warning message:
In FUN(X[[i]], ...) :
For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment."
After replacement of the NA celltype with "unknown", I got this message:
When I do exactly the same with the BlueprintDB database, it works perfectly well.
muscat_analysis is not part of the muscat package, so I have no idea what it is doing. If you wrote it as a wrapper, please provide the code that is actually being run.
I'm sorry for this misunderstanding. Anyway, thank you for your help. I will ask to the muscatWrapper developer. Best regards, Olivier
Yes, agreed. I wasn’t aware there was sich a package… I’m curious though: Why use a muscat wrapper when it’s a couple lines to run the analysis yourself? Should be easier to also understand any issues that might occur… I don’t see the benefit here, because muscat is already a wrapper around pseudobulking+edgeR (for example).
I'm sorry to insist but I would like to understand what happen so I redo the analysis with your pipeline. I obtain exactly the same "pb" objects with two differents celltype annotations (group_id). When I run pbMDS() with one celltype annotation I got this error message:
Error in if (median(f75) < 1e-20) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected
Why I've a library with zero size from on celltype annotation and not with the others?
sce <- as.SingleCellExperiment(seuObjFull, assay="SCT")
sceDICE <- prepSCE(sce1, kid = "annotationSubDICE",gid = "treatment",sid = "SampleID",drop = TRUE)
nkDICE <- length(kidsDICE <- levels(sceDICE$cluster_id))
nsDICE <- length(sidsDICE <- levels(sceDICE$sample_id))
names(kidsDICE) <- kidsDICE; names(sidsDICE) <- sidsDICE
sceBlue <- prepSCE(sce1, kid = "annotation",gid = "treatment",sid = "SampleID",drop = TRUE)
nkBlue <- length(kidsBlue <- levels(sceBlue$cluster_id))
nsBlue <- length(sidsBlue <- levels(sceBlue$sample_id))
names(kidsBlue) <- kidsBlue; names(sidsBlue) <- sidsBlue
# # Data overview
t(table(sceDICE$cluster_id, sce$sample_id))
t(table(sceBlue$cluster_id, sce$sample_id))
pbDICE <- aggregateData(sceDICE,
assay = "counts", fun = "sum",
by = c("cluster_id", "sample_id"))
> pbDICE
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(13): B cells, naive NK cells ... T cells, CD8+, naive,
stimulated unknown
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
> pbMDS(pbDICE)
Error in if (median(f75) < 1e-20) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected
pbBlue <- aggregateData(sceBlue,
assay = "counts", fun = "sum",
by = c("cluster_id", "sample_id"))
> pbBlue
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(15): CD4+ T-cells CD4+ Tcm ... Plasma cells Tregs
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
pb
), so it is not impossible this might happen.table(sce$cluster_id, sce$sample_id)
for the different annotations - maybe there are missing combinations? Also worth checking the library sizes to see where the evil lies, e.g., via sapply(assays(pb), \(.) range(colSums(.)))
for both pbBlue
and pbDICE
.(*To avoid confusion: you said "two different celltype annotations group_id
". But in the context of this analysis, group_id
= experimental condition and cluster_id
= cell type / subpopulation annotation.)
*to avoid confusion: group_id = experimental conditions and cluster_id = cellType annotation (from "Blue" and "DICE" database).
>t(table(sceBlue$cluster_id, sceBlue$sample_id))
Blue CD4+ T-cells CD4+ Tcm CD4+ Tem CD8+ T-cells CD8+ Tcm CD8+ Tem CLP CMP DC Erythrocytes GMP MEP NK cells Plasma cells Tregs
ABE157 2 25 4755 0 119 6 0 0 17 0 0 0 3 0 61
ABE158 4 36 5503 0 149 10 0 0 13 0 0 0 2 0 75
ABE159 3 26 2957 0 101 13 0 0 3 0 0 0 2 0 30
ABE160 0 11 1728 0 57 2 0 0 1 0 0 0 1 0 9
ABE161 0 29 3977 0 172 3 0 0 26 3 0 1 2 1 1020
ABE162 0 22 6838 0 292 2 0 0 31 2 0 1 0 0 1537
ABE163 1 50 4273 0 152 5 0 0 14 11 1 6 1 4 883
ABE164 0 49 4658 0 180 4 0 0 29 10 1 4 0 0 885
ABE165 1 15 6878 0 205 10 0 0 5 0 0 0 0 0 87
ABE166 0 22 6554 0 149 4 0 0 4 0 0 0 0 0 63
ABE167 0 34 7230 0 139 6 0 0 26 7 1 2 0 0 446
ABE168 1 27 7148 0 124 7 0 0 13 6 0 4 0 0 380
ABE169 2 65 3190 1 126 3 0 1 48 22 1 2 0 0 440
ABE170 3 65 4192 0 199 1 0 2 41 18 0 6 0 2 548
ABE171 1 54 6428 0 110 1 0 0 40 5 0 6 0 1 150
ABE172 3 46 4788 0 55 2 1 0 26 5 0 1 0 0 108
ABE173 4 101 4553 1 248 5 1 0 22 19 2 6 3 0 622
ABE174 7 101 4354 0 224 4 0 0 15 18 1 4 0 1 536
>t(table(sceDICE$cluster_id, sceDICE$sample_id))
DICE B cells, naïve NK cells T cells, CD4+, memory TREG T cells, CD4+, naïve stimulated T cells CD4+, naïve T cells, CD4+, TFH T cells, CD4+, Th1 T cells, CD4+, Th1_17 T cells, CD4+, Th17 T cells, CD4+, Th2 T cells, CD8+, naïve T cells, CD8+, naïve stimulated
ABE157 0 42 2041 0 106 8 1415 26 210 1040 0 89
ABE158 0 52 2378 1 112 7 1850 27 212 1038 0 99
ABE159 0 42 1139 0 52 9 984 18 164 691 0 29
ABE160 0 16 643 0 36 0 581 17 122 372 0 21
ABE161 0 45 4177 0 91 6 725 3 14 46 0 122
ABE162 0 77 6808 0 189 5 1297 13 13 61 0 240
ABE163 0 0 3329 0 635 1 1143 15 30 23 1 219
ABE164 0 2 3437 0 1009 1 1067 15 35 24 0 226
ABE165 0 1 2327 0 189 4 4305 81 72 78 0 141
ABE166 1 1 2256 0 166 10 3950 89 105 84 0 131
ABE167 0 5 3816 0 1361 1 2298 33 20 28 0 325
ABE168 0 4 3537 0 1301 0 2416 31 28 27 0 360
ABE169 0 4 2117 0 583 2 964 11 1 15 0 203
ABE170 0 0 2902 0 637 9 1275 16 5 12 0 218
ABE171 0 5 2972 0 710 7 2782 59 16 42 0 187
ABE172 0 9 2024 0 621 4 2091 60 22 49 0 137
ABE173 0 1 2418 0 1840 9 936 23 7 7 0 336
ABE174 0 1 2198 0 1919 8 807 20 7 11 0 279
> str(lapply(assays(pbBlue), \(.) range(colSums(.))))
List of 15
$ CD4+ T-cells: num [1:2] 0 89319
$ CD4+ Tcm : num [1:2] 136239 1234767
$ CD4+ Tem : num [1:2] 21289702 84754447
$ CD8+ T-cells: num [1:2] 0 12653
$ CD8+ Tcm : num [1:2] 635327 3367515
$ CD8+ Tem : num [1:2] 11002 144722
$ CLP : num [1:2] 0 12936
$ CMP : num [1:2] 0 24806
$ DC : num [1:2] 12880 608048
$ Erythrocytes: num [1:2] 0 277991
$ GMP : num [1:2] 0 25527
$ MEP : num [1:2] 0 77227
$ NK cells : num [1:2] 0 35863
$ Plasma cells: num [1:2] 0 50804
$ Tregs : num [1:2] 114381 18990948
> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 12
$ B cells, naive : num [1:2] 0 0
$ NK cells : num [1:2] 0 801166
$ T cells, CD4+, memory TREG : num [1:2] 7993607 81680996
$ T cells, CD4+, naive : num [1:2] 0 0
$ T cells, CD4+, naive, stimulated: num [1:2] 454821 23211410
$ T cells, CD4+, TFH : num [1:2] 0 111842
$ T cells, CD4+, Th1 : num [1:2] 7064002 48864037
$ T cells, CD4+, Th1_17 : num [1:2] 31132 968705
$ T cells, CD4+, Th17 : num [1:2] 11144 2486473
$ T cells, CD4+, Th2 : num [1:2] 77373 12475597
$ T cells, CD8+, naive : num [1:2] 0 0
$ T cells, CD8+, naive, stimulated: num [1:2] 261069 4276299
So it's a little more clear. If I understand, I need to remove the cellType 1, 4, & 11 because there is only 1 cells in these cellTypes. But, just for my understanding, how is it possible to have no gene expressed while I have at least 1 cells?
Thanks again for your help. Olivier
Hello, Finally, I choose 4 cellType with enough cells (the "keep" ones) to run pbDS analysis. But pbDS() excluded 2 of these cellTypes (1&4 in the list above are rejected) while I have enough cells to run DE analysis. Could you tell me why?
Thanks a lot
> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 4
$ T_cells_CD4+_memory_TREG : num [1:2] 7993607 81680996
$ T_cells_CD4+_naive_stimulated: num [1:2] 454821 23211410
$ T_cells_CD4+_Th1 : num [1:2] 7064002 48864037
$ T_cells_CD8+_naive_stimulated: num [1:2] 261069 4276299
ABE157 ABE158 ABE159 ABE160 ABE161 ABE162 ABE163 ABE164 ABE165 ABE166 ABE167 ABE168 ABE169 ABE170 ABE171 ABE172 ABE173 ABE174
NK_cells 42 52 42 16 45 77 0 2 1 1 5 4 4 0 5 9 1 1
Keep T_cells_CD4+_memory_TREG 2041 2378 1139 643 4177 6808 3329 3437 2327 2256 3816 3537 2117 2902 2972 2024 2418 2198
Keep T_cells_CD4+_naive_stimulated 106 112 52 36 91 189 635 1009 189 166 1361 1301 583 637 710 621 1840 1919
T_cells_CD4+_TFH 8 7 9 0 6 5 1 1 4 10 1 0 2 9 7 4 9 8
Keep T_cells_CD4+_Th1 1415 1850 984 581 725 1297 1143 1067 4305 3950 2298 2416 964 1275 2782 2091 936 807
T_cells_CD4+_Th1_17 26 27 18 17 3 13 15 15 81 89 33 31 11 16 59 60 23 20
T_cells_CD4+_Th17 210 212 164 122 14 13 30 35 72 105 20 28 1 5 16 22 7 7
T_cells_CD4+_Th2 1040 1038 691 372 46 61 23 24 78 84 28 27 15 12 42 49 7 11
Keep T_cells_CD8+_naive_stimulated 89 99 29 21 122 240 219 226 141 131 3
contrastDICE1 <- makeContrasts(c("Im1Im1-(Im1wt+Im1no)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE2 <- makeContrasts(c("Im2Im2-(Im2no+Im2wt)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE3 <- makeContrasts(c("wtwt-(wtIm1+wtno)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE<-cbind(contrastDICE1,contrastDICE2,contrastDICE3)
DEDICE<-pbDS(pbDICE, design = mmDICE, contrast = contrastDICE)
It's really hard to grasp the data from what you're posting... But I'd say that from the cell counts by cluster-sample you cannot say directly whether or not you have enough cells to run differential analyses. This depends on how many cells there are per group or, otherwise put, how many samples have sufficiently many cells in each group. At the very least, two samples per group (depending on min_samples
) need to have enough cells (depending on min_cells
). Then, depending on filter
ing, these also have non-zero counts for some genes. Overall, it's not as simple as saying "I have enough cells". In addition, I would advice to caution: Even if there are just enough cells, I wonder whether you should "trust" results from testing a hand full of cells across many samples... Instead, a better approach might be do lower the resolution, i.e., merge some biologically similar subpopulation into broader groups such that they have a decent number of cells across all/most samples. Hope that all makes sense?
So even if I have enough cells/samples in a cell cluster, I should consider that no pbDS output for that cluster could be due to lack of differential gene expression rather than lack of cells .
No, you will get results per gene (p-values, FDR, logFCs) independent of whether or not there's anything differential. But a cluster not being tested (though it has many cells in total) could indicate that there are not enough cells per sample per group... As I said: There need to be at least 2 samples in each group with at least N cells that express a gene.
I've read some documentation on prepSIM and edgeR In the muscatWrapper the settings for pb::edegeR are: filterByExpr.min.count = 7, filterByExpr.min.total.count = 15, filterByExpr.large.n = 4, filterByExpr.min.prop = 0.7
And in prepSIM: min_count = 1, min_cells = 10, min_genes = 100, min_size = 100,
So now it make senses why I'm losing cluster (cell subpopulation).
Thanks again
Hello, I've annotated my sce with BlueprintDB database in SingleR and it's working. When I try to run muscat_analysis on the same sce annotated with DICE database, muscat stop the analysis during calcExprFreqs. But I'm able to run "manually" get_DE-info(). I've fully compare the two databases but I don't find the bug.
Error Message : Error in
rownames<-
(*tmp*
, value = all_genes) : attempt to set 'rownames' on an object with no dimensions In addition: Warning message: In FUN(X[[i]], ...) : For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment.Any idea?
Best regards, Olivier