Support for existing size factors

ATpoint commented 4 years ago

Hi Constantin,

first of all congrats and thanks for this nice and very useful package. I would like to ask two questions:

1) I was wondering whether there is (or you plan to add) an option to use existing size factors rather than estimating them when running glm_gp(). In my case I already have the size factors from scran::calculateSumFactors() which the logcounts are based on that I use throughput the analysis. It would probably not make too much of a difference but in order to be consistent it'd be desirable to use exactly these factors rather than re-estimating them. Depending on the size of the dataset it could also save a bit of time to use existing ones. Would that be possible?

2) This one is rather a general one rather than a technical request relating to the interpretation of the QL ratio test results. Is there a general consensus in your group how you deal with differential expression results that are driven by a few number of cells? For illustration, imagine you had two clusters of cells to compare, say 500 cells each but in each cluster only say 5-10% of cells actually have counts > 0 for the a given gene (see plot with dummy data below). This would tyically be a highly-significant result using either Wilcox-like tests of something like the QL ratio test due to the large number of cells per cluster. Still it is questionable whether like 5% of cells per cluster with counts > 0 are representative. Probably a bit too "philosophical" for a GitHub issue, so please don't feel obligate to reply to this one if you feel like it is inappropriate here. Wanted to catch the chance though to ask this in this context as your package would allow for routine QL-like tests even on large datasets.

Link to Violin dummy plots

Thank you and best wishes! -Alex

const-ae commented 4 years ago

Hi Alex,

thank you so much for your kind words :)

On your first point, you can just call the glm_gp and provide your precalculated vector as the size_factors argument. The length of the vector should of course match the number of columns of the data argument :)

I think that the second point you raise is indeed a very interesting and important question. Even though it is a bit philosophical (or maybe even because ;) ). Personally, I would always advise to be very careful with the interpretation of the results of a DE test between two clusters in a single cell experiment, because cells from the same individual are not indepedent replicates. This of course applies equally to QL and Wilcox test. However, this is usually not a big problem because you want to find "good" marker genes that characterize the clusters. This means you want genes whose expression distribution overlap as little as possible. The QL and Wilcoxon test are tools to find such genes, but the p-values shouldn't be interpreted in the same way as they are for bulk experiments.

const-ae commented 4 years ago

Hey, I will close this issue at this point, because I interpret your thumbs up, that my comment solved your issue. If not feel free to reopen :)

const-ae / glmGamPoi

Support for existing size factors #5