drisso / zinbwave

Clone of the Bioconductor repository for the zinbwave package, see https://bioconductor.org/packages/zinbwave
43 stars 10 forks source link

Strange BCV plot for 10x data #41

Open lazappi opened 5 years ago

lazappi commented 5 years ago

Hi

I have been trying ZINB-Wave for generating weights for edgeR for a 10x dataset and I have got some results that look a bit strange. Here is what I have been doing so far:

zinb_model <- zinbFit(counts(sce),
                      X = model.matrix(~ Cluster, data = colData(sce)),
                      epsilon = 1e12,
                      verbose = TRUE)

zinb_weights <- computeObservationalWeights(zinb_model,
                                            as.matrix(counts(sce)))

n_clusts <- length(unique(colData(sce)$Cluster))
design <- model.matrix(~ 0 + colData(sce)$Cluster)
colnames(design) <- paste0("C", seq_len(n_clusts) - 1)

dge <- DGEList(counts(sce))
dge <- calcNormFactors(dge)
dge$weights <- zinb_weights
dge <- estimateDisp(dge, design)
de_fit <- glmFit(dge, design)

plotBCV(dge)

Which gives this (strange looking) BCV plot:

image

The regular BCV plot (without ZINB-WaVE) weights looks ok:

image

Just wondering if this is normal, I've done something wrong or perhaps the dataset isn't a good fit. Also what effect this is likely to have on downstream DE results?

Thanks

drisso commented 5 years ago

Your code seems reasonable to me. The only thing that I would note is that by default zinbwave uses a common dispersion parameter. We found that in most datasets a common dispersion is sufficient, but it might be that in this case a gene-wise dispersion parameter is preferable?

The other question is whether for this particular dataset you need to account for zero-inflation. I would perhaps look into some goodness-of-fit plots to see if the negative binomial or the zero-inflated negative binomial provide the best fit. An example is Supp. Figure S18 in the zinbwave paper. We have the code to generate that in the paper's github repo. Looking at a histogram of the weights is also usually informative.

drisso commented 5 years ago

Also, I just noticed that in both your BCV plots the logCPM start at 9... don't you have lower expressed genes? Are you filtering out those below that threshold?

lazappi commented 5 years ago

Hmm...that logCPM is strange. I haven't done any filtering like that, maybe it is just a result of low total counts per cell? Thanks for the suggestions, I'll have a look and see if anything jumps out.

lazappi commented 5 years ago

Ok, I've had a bit of a look at some of the things you suggested. Thanks for the ideas and providing the code. The GoF for the means for both ZINB-WaVE and standard edgeR look ok:

image

(Although there is a bit of a trend for edgeR)

But this isn't the case when we look at the GoF for the proportion of zeros:

image

It looks like something has gone wrong with ZINB-WaVE here, not sure what the cause might be other than a lack of zero-inflation??? But I'm guessing this is why the resulting weights I get are either zero or one:

image

I think for the moment I am just going to fall back to regular edgeR but I am keen to give ZINB-WaVE a go if you have any other suggestions.

(P.S. I think the high CPM values are the result of how edgeR calculated logCPMs but I couldn't quite track down exactly what was happening.)

drisso commented 5 years ago

Wow, that GoF for the proportion of zeros is bizarre! @koenvandenberge have you ever seen anything like this?

koenvandenberge commented 5 years ago

The high log-CPMs is indeed normal with 10x-like datasets because of shallow sequencing depth per cell.

@lazappi Could you try plotting the zinbwave weights only for the zeros in the dataset? i.e. hist(weights[counts==0]). I expect that you will get a peak at weights of zero (and the weights of 1 you have now correspond to the ~10% positive counts you have in your 10x dataset). In that case, this might be an example where there is a too aggressive push to assign zeros to the zero inflation component of the ZINB, which is something we sometimes see with droplet-based datasets, and would explain the GoF for the zeros. We have seen this in some, but not all, datasets, and should look into what is causing this.

Thanks for sharing and checking this!

koenvandenberge commented 5 years ago

In the first set of plots, are those MA-plots? Or does `difference' mean something else than log FC in your plots? They are interesting...

lazappi commented 5 years ago

@drisso The histogram for zero counts is just a single bar near zero. I think all the weights for zero counts have been set to 1e15 which I think is the default minimum. I don't know all the details of how the model works but it does seem plausible that all the zeros are being assigned to zero-inflation.

@koenvandenberge In those plots the y-axis is the difference in the estimated and observed gene means (or proportion of zeros for the second set), similar to the Supplementary Figure 18 @drisso mentioned. It is possible that I didn't quite do these right or looked at the wrong bit of code from the paper but I'm fairly sure I have it right.

drisso commented 5 years ago

Hey @lazappi

I just fixed a pretty serious bug in computeObservationalWeights(). Can you please try your example again with version 1.4.2 or 1.5.2 (either from the master branch on GitHub or after waiting ~24hrs in Bioc) and see if your BCV plot gets better?

lazappi commented 5 years ago

image

I've had a quick go and it looks better. There's still something weird with the trend line though.

koenvandenberge commented 5 years ago

@lazappi Thanks! Looks much better indeed. Are you willing to share the DGEList object that you used to create the BCV plot? The trend line issue is likely just a plotting issue, but the trend in itself should be fine. I can take a look at the plotBCV function to see if this can be fixed, if you'd be willing to share the DGEList object. Feel free to remove the counts slot etc. from the DGEList object if this concerns unpublished data.

Thanks, Koen

lazappi commented 5 years ago

Here is the object https://we.tl/t-SDXwy0OQ4S. Link should expire in a week so make sure you save it somewhere.

vincr04 commented 5 years ago

Hi all,

Hmm...that logCPM is strange. I haven't done any filtering like that, maybe it is just a result of low total counts per cell? Thanks for the suggestions, I'll have a look and see if anything jumps out.

I've made the same observation and realised that depending on how I calculate the average logCPM for my own 10X data I get very different results: Rplot48

The plot on the right has logCPM values that actually reflect my data, but somehow it doesn't look like the others at all. Does anyone have an explanation? Am I doing something wrong here?

Also, I have the same issue as @lazappi when I calculate the GoF for the zero probability. Any idea where this might come from? Rplot56

EDIT Actually I just found out what the issue is: in the computeP0 function from the code related to Supp. Figure S18, when phi is very small (and it seems to be consistently small in 10X data), R calculates 1 + phi = 1, which leads to all values in zinbPY0 to become 1. I solved this by editing the last line of computeP0 with mpfr to force R to keep more decimals (although a bit slower).

library(Rmpfr) pi + ((1 - pi) * ((1 + mpfr(phi, precBits = 128) * mu) ^ (-1/phi)))

and then

zinbPY0 = as.numeric(rowMeans(computeP0(zz)))

Vinnie