I almost did this myself, and corrected myself at the last minute, so most users probably wouldn't even notice. The problem is as follows:
normalize automatically recenters the size factors to have a mean of unity before calculating values in exprs. This is undoubtedly the right thing to do, for several reasons; it allows the exprs values to be interpreted as normalized log-counts, and it allows sensible comparison of abundances between features normalized with different sets of size factors (e.g., endogenous genes and spike-in transcripts).
If you subset a SCESet object, it's a good idea to rerun normalize, as it ensures that abundances are comparable within the subset. For example, if the cells in the selected subset had consistently small size factors for the endogenous genes and we failed to rerun normalize, the abundance of a gene would be higher than that of a spike-in transcript with identical counts. This is problematic when trying to model the technical component of the variance, where abundances need to be comparable.
However; if you follow the (good) advice in the first two steps, exprs will no longer be comparable between different subsets of the original SCESet object, because they've been normalized separately on different scales. This is the gotcha that we should warn against - after running normalize on an object, expression values (and recentred size factors) are not comparable to those in another object.
Arguably, though, the same general point applies to all types of data. If you use effective library sizes for CPMs or FPKMs, then you need to recenter the size factors on the mean library size, and the latter value will change across subsets/objects. It also goes without saying that you shouldn't be comparing raw counts between objects - not without some normalization, and to do that you'll have to normalize across all cells at once, rather than normalizing within each object and comparing the values.
I almost did this myself, and corrected myself at the last minute, so most users probably wouldn't even notice. The problem is as follows:
normalize
automatically recenters the size factors to have a mean of unity before calculating values inexprs
. This is undoubtedly the right thing to do, for several reasons; it allows theexprs
values to be interpreted as normalized log-counts, and it allows sensible comparison of abundances between features normalized with different sets of size factors (e.g., endogenous genes and spike-in transcripts).SCESet
object, it's a good idea to rerunnormalize
, as it ensures that abundances are comparable within the subset. For example, if the cells in the selected subset had consistently small size factors for the endogenous genes and we failed to rerunnormalize
, the abundance of a gene would be higher than that of a spike-in transcript with identical counts. This is problematic when trying to model the technical component of the variance, where abundances need to be comparable.exprs
will no longer be comparable between different subsets of the originalSCESet
object, because they've been normalized separately on different scales. This is the gotcha that we should warn against - after runningnormalize
on an object, expression values (and recentred size factors) are not comparable to those in another object.Arguably, though, the same general point applies to all types of data. If you use effective library sizes for CPMs or FPKMs, then you need to recenter the size factors on the mean library size, and the latter value will change across subsets/objects. It also goes without saying that you shouldn't be comparing raw counts between objects - not without some normalization, and to do that you'll have to normalize across all cells at once, rather than normalizing within each object and comparing the values.