cole-trapnell-lab / monocle-release

272 stars 114 forks source link

Bug in differentialGeneTest with relative_expr = FALSE and running estimateSizeFactors #52

Open mebbert opened 6 years ago

mebbert commented 6 years ago

Hi,

I've been getting familiar with Monocle, and I believe I found a bug when running estimateSizeFactors before running differentialGeneTest with relative_expr = FALSE. If I understand the documentation and differentialGeneTest code properly, size factors should be ignored when relative_expr = FALSE, but I don't think that's happening.

Basically, I am getting 30 differentially expressed genes if I run

expr.data<-read.delim(<file>, sep="")
cell_sample_sheet<-read.delim("sample_sheet.txt", row.names=1)
pd<-new("AnnotatedDataFrame", data=cell_sample_sheet)

gene_ann<-data.frame(gene_short_name = row.names(expr.data), row.names = row.names(expr.data))
fd <- new("AnnotatedDataFrame",data=gene_ann)

cells<-newCellDataSet(as.matrix(expr.data),
                      phenoData = pd,
                      featureData = fd,
                      expressionFamily=negbinomial.size())

cells <- estimateSizeFactors(cells)
cells <- estimateDispersions(cells)

cells<-detectGenes(cells, min_expr = 0.1)

expressed_genes <- row.names(subset(fData(cells), num_cells_expressed >= 10))

cells.diff_test_res <- differentialGeneTest(cells,
                                      fullModelFormulaStr = "~Genotype",
                                      cores=8,
                                      relative_expr = FALSE)
cells.sig_genes <- subset(cells.diff_test_res, qval < 0.1)

I get 126 diff. expressed genes if I omit cells <- estimateSizeFactors(cells), however. I assume the result with 126 genes is the correct result, since I'm using 10x Genomics single cell (UMI) data. I didn't realize I don't need to run cells <- estimateSizeFactors(cells).

Also, a less important (and unrelated) "bug" (if it even qualifies as a bug), it looks like plot_genes_jitter calculates cds_exprs$adjusted_expression <- log10(cds_exprs$expression), but never uses it. The data is transformed by ggplot, instead.

Really appreciate your help.

Correction

Apparently you cannot run cells <- estimateDispersions(cells) without first running cells <- estimateSizeFactors(cells) in a clean environment. I deleted all objects (rm(list=ls())) and re-ran the code, getting an error stating that I had to run cells <- estimateSizeFactors(cells), first.

I do think something funky is going on though, because if I re-run the code (omitting cells <- estimateSizeFactors(cells)) without deleting all objects, I am able to run cells <- estimateDispersions(cells), and I get the different results mentioned above.

So, I guess there are potentially four different issues here:

  1. Should I run cells <- estimateSizeFactors(cells) and cells <- estimateDispersions(cells) with 10x Genomics UMI data? This is directly related to #50.
  2. Is there a bug where erroneous size factors persist even if I re-run newCellDataSet?
  3. Should differentialGeneTest with relative_expr = FALSE ignore whether cells <- estimateSizeFactors(cells) and cells <- estimateDispersions(cells) are run/set?
  4. Oh, and the small "bug" in plot_genes_jitter.
Xiaojieqiu commented 6 years ago

hi @mebbert I really appreciate your detailed description and sorry for the delay in response.

First of all, I am confused on why you can get different results (30 genes and 126 genes) when you use the same differentialGeneTest call. The size factor won't matter and I believe that if we set relative_expr = FALSE, the data won't be normalized by the library size.