davismcc / archive-scater

An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
64 stars 18 forks source link

Improved documentation to warn against comparing normalized values #91

Closed LTLA closed 7 years ago

LTLA commented 7 years ago

I almost did this myself, and corrected myself at the last minute, so most users probably wouldn't even notice. The problem is as follows:

  1. normalize automatically recenters the size factors to have a mean of unity before calculating values in exprs. This is undoubtedly the right thing to do, for several reasons; it allows the exprs values to be interpreted as normalized log-counts, and it allows sensible comparison of abundances between features normalized with different sets of size factors (e.g., endogenous genes and spike-in transcripts).
  2. If you subset a SCESet object, it's a good idea to rerun normalize, as it ensures that abundances are comparable within the subset. For example, if the cells in the selected subset had consistently small size factors for the endogenous genes and we failed to rerun normalize, the abundance of a gene would be higher than that of a spike-in transcript with identical counts. This is problematic when trying to model the technical component of the variance, where abundances need to be comparable.
  3. However; if you follow the (good) advice in the first two steps, exprs will no longer be comparable between different subsets of the original SCESet object, because they've been normalized separately on different scales. This is the gotcha that we should warn against - after running normalize on an object, expression values (and recentred size factors) are not comparable to those in another object.

Arguably, though, the same general point applies to all types of data. If you use effective library sizes for CPMs or FPKMs, then you need to recenter the size factors on the mean library size, and the latter value will change across subsets/objects. It also goes without saying that you shouldn't be comparing raw counts between objects - not without some normalization, and to do that you'll have to normalize across all cells at once, rather than normalizing within each object and comparing the values.

LTLA commented 7 years ago

Okay, added in #92. Happy for this to be closed upon merge.