muscat_analysis don't accept DICE annotation

obotman commented 1 year ago

Hello, I've annotated my sce with BlueprintDB database in SingleR and it's working. When I try to run muscat_analysis on the same sce annotated with DICE database, muscat stop the analysis during calcExprFreqs. But I'm able to run "manually" get_DE-info(). I've fully compare the two databases but I don't find the bug.

Error Message : Error in rownames<-(*tmp*, value = all_genes) : attempt to set 'rownames' on an object with no dimensions In addition: Warning message: In FUN(X[[i]], ...) : For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment.

Any idea?

Best regards, Olivier

HelenaLC commented 1 year ago

Please provide a reproducible example, otherwise this is hard to understand and I can not help… also I’m struggling with understanding a couple sentences. E.g., what do you mean by “for celltype NA gene names got lost”? Have you tried setting these to something like “untitled”?

obotman commented 1 year ago

I just replace the NA gene names by "unknown" and now I've this error in:

Capture

rem: i'm very new here, is it possible to put big file as example in GitHub?

HelenaLC commented 1 year ago

NA gene name? I thought you said celltype name? …please be specific. - no, I didn’t mean uploading the data. But putting line by line the code leading up the error as well as relevant intermediate outputs that might help understand what’s causing the issue.

obotman commented 1 year ago

Here is an abstract of the code I use:

sce <- as.SingleCellExperiment(seuratObject, assay="SCT")
refDICE <- DatabaseImmuneCellExpressionData()
SubDICE <- SingleR(test=sce, ref=refDICE, labels=refDICE$label.fine)
sce@colData$annotationSubDICE<-SubDICE$pruned.labels
sce$annotationSubDICE[is.na(sce$annotationSubDICE)]<-"unknown"
sce$treatment <- dplyr::recode(sce$SampleID,
  ABE157 = "Im1wt", ABE158 = "Im1wt", ABE159 = "Im1no", ABE160 = "Im1no", ABE161 = "Im1Im1", ABE162 = "Im1Im1",
  ABE163 = "wtwt", ABE164 = "wtwt", ABE165 = "wtno", ABE166 = "wtno", ABE167 = "wtIm1", ABE168 = "wtIm1",
  ABE169 = "Im2wt", ABE170 = "Im2wt", ABE171 = "Im2no", ABE172 = "Im2no", ABE173 = "Im2Im2", ABE174 = "Im2Im2"
)
contrasts_oi <- c("'Im1Im1-(Im1wt+Im1no)/2','Im2Im2-(Im2no+Im2wt)/2','wtwt-(wtIm1+wtno)/2'")
contrast_tbl <- data.frame(contrast = c("Im1Im1-(Im1wt+Im1no)/2","Im2Im2-(Im2no+Im2wt)/2","wtwt-(wtIm1+wtno)/2"), group = `c("Im1Im1","Im2Im2","wtwt"))`
cluster_id="annotationSubDICE"
sample_id="SampleID"
group_id="treatment"
muscat <- muscat_analysis(
     sce = sce,
     celltype_id = celltype_id,
     sample_id = sample_id,
     group_id = group_id,
     covariates = NA,
     contrasts_oi = contrasts_oi,
     contrast_tbl = contrast_tbl,
     assay_oi_pb = "counts",
     fun_oi_pb = "sum",
     de_method_oi = "edgeR",
     min_cells = 10,
     verbose=TRUE)

When I don't replace the NA celltype name with "unknown", I got this message:

"[1] "Calculate expression information" Error in rownames<-(*tmp*, value = all_genes) : attempt to set 'rownames' on an object with no dimensions In addition: Warning message: In FUN(X[[i]], ...) : For Celltype NA gene names got lost while calculating the fraction of expression of each gene with muscat::calcExprFreqs (due to a bug in this function). We temporality fixed this ourselves for the moment."

After replacement of the NA celltype with "unknown", I got this message:

Capture

When I do exactly the same with the BlueprintDB database, it works perfectly well.

HelenaLC commented 1 year ago

muscat_analysis is not part of the muscat package, so I have no idea what it is doing. If you wrote it as a wrapper, please provide the code that is actually being run.

obotman commented 1 year ago

I'm sorry for this misunderstanding. Anyway, thank you for your help. I will ask to the muscatWrapper developer. Best regards, Olivier

HelenaLC commented 1 year ago

Yes, agreed. I wasn’t aware there was sich a package… I’m curious though: Why use a muscat wrapper when it’s a couple lines to run the analysis yourself? Should be easier to also understand any issues that might occur… I don’t see the benefit here, because muscat is already a wrapper around pseudobulking+edgeR (for example).

obotman commented 1 year ago

I'm sorry to insist but I would like to understand what happen so I redo the analysis with your pipeline. I obtain exactly the same "pb" objects with two differents celltype annotations (group_id). When I run pbMDS() with one celltype annotation I got this error message:

Error in if (median(f75) < 1e-20) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected

Why I've a library with zero size from on celltype annotation and not with the others?

sce <- as.SingleCellExperiment(seuObjFull, assay="SCT")
sceDICE <- prepSCE(sce1, kid = "annotationSubDICE",gid = "treatment",sid = "SampleID",drop = TRUE)
nkDICE <- length(kidsDICE <- levels(sceDICE$cluster_id))
nsDICE <- length(sidsDICE <- levels(sceDICE$sample_id))
names(kidsDICE) <- kidsDICE; names(sidsDICE) <- sidsDICE
sceBlue <- prepSCE(sce1, kid = "annotation",gid = "treatment",sid = "SampleID",drop = TRUE)
nkBlue <- length(kidsBlue <- levels(sceBlue$cluster_id))
nsBlue <- length(sidsBlue <- levels(sceBlue$sample_id))
names(kidsBlue) <- kidsBlue; names(sidsBlue) <- sidsBlue

# # Data overview
t(table(sceDICE$cluster_id, sce$sample_id))
t(table(sceBlue$cluster_id, sce$sample_id))

pbDICE <- aggregateData(sceDICE,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
> pbDICE
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(13): B cells, naive NK cells ... T cells, CD8+, naive,
  stimulated unknown
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
> pbMDS(pbDICE)
Error in if (median(f75) < 1e-20) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In DGEList(unname(y), remove.zeros = TRUE) : library size of zero detected

pbBlue <- aggregateData(sceBlue,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
> pbBlue
class: SingleCellExperiment
dim: 26345 18
metadata(2): experiment_info agg_pars
assays(15): CD4+ T-cells CD4+ Tcm ... Plasma cells Tregs
rownames(26345): AL627309.1 AL627309.5 ... AC007325.4 AC007325.2
rowData names(0):
colnames(18): ABE157 ABE158 ... ABE173 ABE174
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

HelenaLC commented 1 year ago

Zero library sizes mean that some cluster-sample combinations do not express any features. Different annotations will not give the exact same pseudobulks (pb), so it is not impossible this might happen.
Could you perhaps print the output of table(sce$cluster_id, sce$sample_id) for the different annotations - maybe there are missing combinations? Also worth checking the library sizes to see where the evil lies, e.g., via sapply(assays(pb), \(.) range(colSums(.))) for both pbBlue and pbDICE.

(*To avoid confusion: you said "two different celltype annotations group_id". But in the context of this analysis, group_id = experimental condition and cluster_id = cell type / subpopulation annotation.)

obotman commented 1 year ago

*to avoid confusion: group_id = experimental conditions and cluster_id = cellType annotation (from "Blue" and "DICE" database).

>t(table(sceBlue$cluster_id, sceBlue$sample_id))
Blue    CD4+ T-cells    CD4+ Tcm    CD4+ Tem    CD8+ T-cells    CD8+ Tcm    CD8+ Tem    CLP CMP DC  Erythrocytes    GMP MEP NK cells    Plasma cells    Tregs
ABE157  2   25  4755    0   119 6   0   0   17  0   0   0   3   0   61
ABE158  4   36  5503    0   149 10  0   0   13  0   0   0   2   0   75
ABE159  3   26  2957    0   101 13  0   0   3   0   0   0   2   0   30
ABE160  0   11  1728    0   57  2   0   0   1   0   0   0   1   0   9
ABE161  0   29  3977    0   172 3   0   0   26  3   0   1   2   1   1020
ABE162  0   22  6838    0   292 2   0   0   31  2   0   1   0   0   1537
ABE163  1   50  4273    0   152 5   0   0   14  11  1   6   1   4   883
ABE164  0   49  4658    0   180 4   0   0   29  10  1   4   0   0   885
ABE165  1   15  6878    0   205 10  0   0   5   0   0   0   0   0   87
ABE166  0   22  6554    0   149 4   0   0   4   0   0   0   0   0   63
ABE167  0   34  7230    0   139 6   0   0   26  7   1   2   0   0   446
ABE168  1   27  7148    0   124 7   0   0   13  6   0   4   0   0   380
ABE169  2   65  3190    1   126 3   0   1   48  22  1   2   0   0   440
ABE170  3   65  4192    0   199 1   0   2   41  18  0   6   0   2   548
ABE171  1   54  6428    0   110 1   0   0   40  5   0   6   0   1   150
ABE172  3   46  4788    0   55  2   1   0   26  5   0   1   0   0   108
ABE173  4   101 4553    1   248 5   1   0   22  19  2   6   3   0   622
ABE174  7   101 4354    0   224 4   0   0   15  18  1   4   0   1   536

>t(table(sceDICE$cluster_id, sceDICE$sample_id))
DICE    B cells, naïve  NK cells    T cells, CD4+, memory TREG  T cells, CD4+, naïve stimulated T cells CD4+, naïve T cells, CD4+, TFH  T cells, CD4+, Th1  T cells, CD4+, Th1_17   T cells, CD4+, Th17 T cells, CD4+, Th2  T cells, CD8+, naïve    T cells, CD8+, naïve stimulated
ABE157  0   42  2041    0   106 8   1415    26  210 1040    0   89
ABE158  0   52  2378    1   112 7   1850    27  212 1038    0   99
ABE159  0   42  1139    0   52  9   984 18  164 691 0   29
ABE160  0   16  643 0   36  0   581 17  122 372 0   21
ABE161  0   45  4177    0   91  6   725 3   14  46  0   122
ABE162  0   77  6808    0   189 5   1297    13  13  61  0   240
ABE163  0   0   3329    0   635 1   1143    15  30  23  1   219
ABE164  0   2   3437    0   1009    1   1067    15  35  24  0   226
ABE165  0   1   2327    0   189 4   4305    81  72  78  0   141
ABE166  1   1   2256    0   166 10  3950    89  105 84  0   131
ABE167  0   5   3816    0   1361    1   2298    33  20  28  0   325
ABE168  0   4   3537    0   1301    0   2416    31  28  27  0   360
ABE169  0   4   2117    0   583 2   964 11  1   15  0   203
ABE170  0   0   2902    0   637 9   1275    16  5   12  0   218
ABE171  0   5   2972    0   710 7   2782    59  16  42  0   187
ABE172  0   9   2024    0   621 4   2091    60  22  49  0   137
ABE173  0   1   2418    0   1840    9   936 23  7   7   0   336
ABE174  0   1   2198    0   1919    8   807 20  7   11  0   279

> str(lapply(assays(pbBlue), \(.) range(colSums(.))))
List of 15
 $ CD4+ T-cells: num [1:2] 0 89319
 $ CD4+ Tcm    : num [1:2] 136239 1234767
 $ CD4+ Tem    : num [1:2] 21289702 84754447
 $ CD8+ T-cells: num [1:2] 0 12653
 $ CD8+ Tcm    : num [1:2] 635327 3367515
 $ CD8+ Tem    : num [1:2] 11002 144722
 $ CLP         : num [1:2] 0 12936
 $ CMP         : num [1:2] 0 24806
 $ DC          : num [1:2] 12880 608048
 $ Erythrocytes: num [1:2] 0 277991
 $ GMP         : num [1:2] 0 25527
 $ MEP         : num [1:2] 0 77227
 $ NK cells    : num [1:2] 0 35863
 $ Plasma cells: num [1:2] 0 50804
 $ Tregs       : num [1:2] 114381 18990948

> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 12
 $ B cells, naive                  : num [1:2] 0 0
 $ NK cells                        : num [1:2] 0 801166
 $ T cells, CD4+, memory TREG      : num [1:2] 7993607 81680996
 $ T cells, CD4+, naive            : num [1:2] 0 0
 $ T cells, CD4+, naive, stimulated: num [1:2] 454821 23211410
 $ T cells, CD4+, TFH              : num [1:2] 0 111842
 $ T cells, CD4+, Th1              : num [1:2] 7064002 48864037
 $ T cells, CD4+, Th1_17           : num [1:2] 31132 968705
 $ T cells, CD4+, Th17             : num [1:2] 11144 2486473
 $ T cells, CD4+, Th2              : num [1:2] 77373 12475597
 $ T cells, CD8+, naive            : num [1:2] 0 0
 $ T cells, CD8+, naive, stimulated: num [1:2] 261069 4276299

So it's a little more clear. If I understand, I need to remove the cellType 1, 4, & 11 because there is only 1 cells in these cellTypes. But, just for my understanding, how is it possible to have no gene expressed while I have at least 1 cells?

Thanks again for your help. Olivier

obotman commented 1 year ago

Hello, Finally, I choose 4 cellType with enough cells (the "keep" ones) to run pbDS analysis. But pbDS() excluded 2 of these cellTypes (1&4 in the list above are rejected) while I have enough cells to run DE analysis. Could you tell me why?

Thanks a lot

> str(lapply(assays(pbDICE), \(.) range(colSums(.))))
List of 4
 $ T_cells_CD4+_memory_TREG     : num [1:2] 7993607 81680996
 $ T_cells_CD4+_naive_stimulated: num [1:2] 454821 23211410
 $ T_cells_CD4+_Th1             : num [1:2] 7064002 48864037
 $ T_cells_CD8+_naive_stimulated: num [1:2] 261069 4276299

    ABE157  ABE158  ABE159  ABE160  ABE161  ABE162  ABE163  ABE164  ABE165  ABE166  ABE167  ABE168  ABE169  ABE170  ABE171  ABE172  ABE173  ABE174
    NK_cells    42  52  42  16  45  77  0   2   1   1   5   4   4   0   5   9   1   1
Keep    T_cells_CD4+_memory_TREG    2041    2378    1139    643 4177    6808    3329    3437    2327    2256    3816    3537    2117    2902    2972    2024    2418    2198
Keep    T_cells_CD4+_naive_stimulated   106 112 52  36  91  189 635 1009    189 166 1361    1301    583 637 710 621 1840    1919
    T_cells_CD4+_TFH    8   7   9   0   6   5   1   1   4   10  1   0   2   9   7   4   9   8
Keep    T_cells_CD4+_Th1    1415    1850    984 581 725 1297    1143    1067    4305    3950    2298    2416    964 1275    2782    2091    936 807
    T_cells_CD4+_Th1_17 26  27  18  17  3   13  15  15  81  89  33  31  11  16  59  60  23  20
    T_cells_CD4+_Th17   210 212 164 122 14  13  30  35  72  105 20  28  1   5   16  22  7   7
    T_cells_CD4+_Th2    1040    1038    691 372 46  61  23  24  78  84  28  27  15  12  42  49  7   11
Keep    T_cells_CD8+_naive_stimulated   89  99  29  21  122 240 219 226 141 131 3

contrastDICE1 <- makeContrasts(c("Im1Im1-(Im1wt+Im1no)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE2 <- makeContrasts(c("Im2Im2-(Im2no+Im2wt)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE3 <- makeContrasts(c("wtwt-(wtIm1+wtno)/2"), levels = c("Im1Im1","Im1wt","Im1no","Im2Im2","Im2no","Im2wt","wtwt","wtIm1","wtno"))
contrastDICE<-cbind(contrastDICE1,contrastDICE2,contrastDICE3)

DEDICE<-pbDS(pbDICE, design = mmDICE, contrast = contrastDICE)

HelenaLC commented 1 year ago

It's really hard to grasp the data from what you're posting... But I'd say that from the cell counts by cluster-sample you cannot say directly whether or not you have enough cells to run differential analyses. This depends on how many cells there are per group or, otherwise put, how many samples have sufficiently many cells in each group. At the very least, two samples per group (depending on min_samples) need to have enough cells (depending on min_cells). Then, depending on filtering, these also have non-zero counts for some genes. Overall, it's not as simple as saying "I have enough cells". In addition, I would advice to caution: Even if there are just enough cells, I wonder whether you should "trust" results from testing a hand full of cells across many samples... Instead, a better approach might be do lower the resolution, i.e., merge some biologically similar subpopulation into broader groups such that they have a decent number of cells across all/most samples. Hope that all makes sense?

obotman commented 1 year ago

So even if I have enough cells/samples in a cell cluster, I should consider that no pbDS output for that cluster could be due to lack of differential gene expression rather than lack of cells .

HelenaLC commented 1 year ago

No, you will get results per gene (p-values, FDR, logFCs) independent of whether or not there's anything differential. But a cluster not being tested (though it has many cells in total) could indicate that there are not enough cells per sample per group... As I said: There need to be at least 2 samples in each group with at least N cells that express a gene.

obotman commented 1 year ago

I've read some documentation on prepSIM and edgeR In the muscatWrapper the settings for pb::edegeR are: filterByExpr.min.count = 7, filterByExpr.min.total.count = 15, filterByExpr.large.n = 4, filterByExpr.min.prop = 0.7

And in prepSIM: min_count = 1, min_cells = 10, min_genes = 100, min_size = 100,

So now it make senses why I'm losing cluster (cell subpopulation).

Thanks again

HelenaLC / muscat

muscat_analysis don't accept DICE annotation #120