Inspired by a question that I was asked a while ago, but forgot to write down somewhere.
Basically: should we continue to remove outlier cells with relatively low coverage, even in the "asymptotic" case of a deeply sequenced experiment where low relative coverage is still deep enough for downstream work?
My answer was: yes, we should, because the phenomenon that caused them to be low outliers may not be a perfect scaling process. Rather, it may skew the transcriptomic profile, e.g., favoring shorter transcripts when amplification is non-optimal. This non-scaling behavior does not seem to be unusual in bulk experiments where scaling normalization is not sufficient to rescue libraries with ~10-fold decreases in size compared to their replicates. Of course, it might ultimately turn to be perfectly scaling, but we'd only know after we did the downstream steps and checked the clustering, so we take the more conservative route here.
As an aside: an implication of the above reasoning is that none of the differences in library size may be perfectly scaling, and that we should do some non-linear normalization instead. I'm not down for that. Can't remember whether I mention this in the text, but: technically difficult, computationally expensive, and comes with its own dirty laundry, e.g., assumption of a majority non-DE'ness across the covariate range. Probably fine to just remove the worst offenders and live with the rest.
Anyway, we could probably inject a sentence about the potential inability of normalization to rescue low-coverage cells in the QC section.
Inspired by a question that I was asked a while ago, but forgot to write down somewhere.
Basically: should we continue to remove outlier cells with relatively low coverage, even in the "asymptotic" case of a deeply sequenced experiment where low relative coverage is still deep enough for downstream work?
My answer was: yes, we should, because the phenomenon that caused them to be low outliers may not be a perfect scaling process. Rather, it may skew the transcriptomic profile, e.g., favoring shorter transcripts when amplification is non-optimal. This non-scaling behavior does not seem to be unusual in bulk experiments where scaling normalization is not sufficient to rescue libraries with ~10-fold decreases in size compared to their replicates. Of course, it might ultimately turn to be perfectly scaling, but we'd only know after we did the downstream steps and checked the clustering, so we take the more conservative route here.
As an aside: an implication of the above reasoning is that none of the differences in library size may be perfectly scaling, and that we should do some non-linear normalization instead. I'm not down for that. Can't remember whether I mention this in the text, but: technically difficult, computationally expensive, and comes with its own dirty laundry, e.g., assumption of a majority non-DE'ness across the covariate range. Probably fine to just remove the worst offenders and live with the rest.
Anyway, we could probably inject a sentence about the potential inability of normalization to rescue low-coverage cells in the QC section.