add weights in pseudbulkDGE for edgeR analysis

koenvandenberge commented 1 year ago

Hi,

In some cases, one might want to add weights to the dispersion estimation and fitting of the edgeR model when using the pseudobulkDGE function. In my case this would be sample-level weights (much like voom with sample quality weights), but I imagine that observation-level weights may also be of interest, so I have allowed a matrix of weights that is of same dimension as the expression matrix. This seems to be possible using only a minor adjustment.

Let me know what you think.

LTLA commented 1 year ago

In principle, the ability to include weights would be useful. However, there is a minor and major problem.

The minor problem is that weights needs to be subsetted by chosen to match the subsetting of x and col.data.

The major problem is that I don't see how these weights can be easily computed by your average user without asking them to go through the edgeR pipeline for each comparison. I assume that you're computing weights using knowledge of the design matrix, plus some normalization and filtering information, etc. If that's the case, you probably were already halfway through some kind of edgeR analysis, so why bother using pseudoBulkDGE at all?

How are you deriving your weights? Unlike limma, we don't have many weighting functions in edgeR, and the interpretation of the weights isn't so clear-cut either... something to do with scaling the QL dispersion IIRC. But if you do have a function, a more ergonomic approach would be to allow users to pass a weighting function that runs on y (possibly also passing in curdesign) and is expected to emit a vector/matrix of weights for each label.

koenvandenberge commented 1 year ago

Thank you for looking into this; I missed the subsetting and have adapted the pull request.

On the major problem, I think it depends. Personally, I am only interested in sample-level weights that do not depend on the gene. These can be estimated for all cell types in one go on the aggregated SCE, and subsequently provided to pseudobulkDGE, without first having to write a custom loop across cell types. Even if the weights would be gene-specific, they can be susbetted based on the filterByExpr filtering used in pseudobulkDGE, after first calculating them on the full pseudobulk data. Sometimes, I think sample-level weights can be calculated only based on the colData too, e.g., sample purity/contamination effects or even inverse probability weighting approaches. These require working on the full pseudo-bulk data first, to estimate the (inverse) probabilities / weights and therefore it seems suboptimal to do this for every cell type separately using a function.

Regarding the interpretation, how much of an issue do you think it is? You'll know better, but it seems like it may be more clear-cut when not using the QL framework and so, if weights are used, the estimateDisp and glmFit workflow could be used instead, although that might be taking it too far for scran.

I understand that the suggestion may be niche, and if you think it does not fit within scran, feel free to decline the pull request. I was just hoping to make my future life a bit easier.

LTLA commented 1 year ago

The proposed changes to the code seem fine enough... provided the weights are sensible.

And I do worry about whether you have sensible weights to pass in. I was staring at edgeR's C++ code to jog my memory, and I think that the weights are being applied to the mean-variance function for each observation. So: if you were to say that an observation has a user-supplied weight of 2, you're claiming it's half as variable as it should be for its mean. This has the (superficially desired) effect of increasing its influence on the coefficient estimates, but it's also a very specific meaning, and I don't think you can just throw in any value and expect type I error control to be maintained.

On that note: IIRC, as soon as you have a non-unity weight, you're in already halfway into QL territory, as the QL methods also involve scaling the variance function. Specifically, you can't just divide the variance by some arbitrary user-supplied weight and still consider the observations to follow a negative binomial distribution - these non-normal models don't scale up and down like that, and that's why we have QL. So the choice between QL/non-QL shouldn't be the issue here.

Having said all of that, I barely remember any of this stuff, and my grasp of it was fairly tenuous to begin with, so it might just all be fine. Perhaps @gksmyth or @yunshun could provide their thoughts. I remember some discussion about weighting a few years ago, but I can't remember what the conclusion was.

gksmyth commented 1 year ago

Please be careful. Weights used for edgeR glm fitting and weights input to NB dispersion estimation are not the same thing. For glms and the QL pipeline, weights have a clear meaning in terms of the quasi-dispersions. For non-QL functions like estimateDisp(), weights only have a meaning in terms of interpreting observations as averages, which I suspect will not have the flexibility that @koenvandenberge is after.

Inverse probability weights are designed to correct biases in the linear model estimates and are not related to variances or dispersions. As far as I can see, inverse probability weights would not work correctly with dispersion estimation functions or with differential expression tests.

MarioniLab / scran

add weights in pseudbulkDGE for edgeR analysis #107