Page 5: Data normalisation and batch correction: Paragraph 1

Normalisation using a size factor for each sample to scale RNA-seq library sizes to make samples more comparable has proven useful in the analysis of bulk RNA-seq data. The methods TMM (Robinson and Oshlack, 2010), relative log-expression (Anders and Huber, 2010) and upper-quartile (Bullard et al., 2010) are frequently used. Size-factor normalisation is supported in scater, with these three methods available, as well as tight integration with the scran package that implements a method utilising cell pooling and deconvolution to compute size factors better suited to scRNA-seq data (Lun et al., 2016).

A smoother wording, with less breathing required:

Scaling normalization is typically required in RNA-seq data analysis to remove biases caused by differences in sequencing depth, capture efficiency or composition effects between samples. Frequently used methods for scaling normalization include the trimmed mean of M-values (Robinson and Oshlack, 2010), relative log-expression (Anders and Huber, 2010) and upper-quartile methods (Bullard et al., 2010), all of which are available for use in scater. In addition, scater is tightly integrated with the scran package that implements a method utilising cell pooling and deconvolution to compute size factors better suited to scRNA-seq data (Lun et al., 2016).

I also suggest replacing "size factor normalization" with "scaling normalization", as the latter is a more standard term. The introduction above makes it clear that scaling refers to normalization with size factors.

Such normalisation is necessary, but further correction is typically required to ameliorate or remove batch effects. Here we present three possibilities, all easily implemented in a scater workflow. We emphasise that it is generally preferable to incorporate batch effects into statistical models used for inference. Where this is not possible, and for visualisations, approaches such as the following may be used.

Suggest you break this into a separate paragraph as we're talking about batch correction now. Also merge with the next paragraph for tighter reading:

After scaling normalisation, further correction is typically required to ameliorate or remove batch effects. For example, in the case study dataset, cells from two patients were each processed on two C1 machines. Although C1 machine is not one of the most important explanatory variables on a per-gene level (Figure 2e), this factor is correlated with the first principal component of the log-expression data (Figure 2f). This effect cannot be removed by scaling normalisation methods, which target cell-specific biases and are not sufficient for removing large-scale batch effects that vary on a gene-by-gene basis (Figure 4a). Here we present two possibilities, all easily implemented in a scater workflow.

Also, there only seems to be two possibilities, rather than three.

We emphasise that it is generally preferable to incorporate batch effects into statistical models used for inference. Where this is not possible, and for visualisations, approaches such as the following may be used.

Just chuck this to the end of the section, after you describe the different methods. It's not necessary to mention this at the start, it's too subtle a point. Also, visualization is a subset of "when it's not possible", so we should instead do:

We emphasise that it is generally preferable to incorporate batch effects or latent variables into statistical models used for inference. Where this is not possible (e.g., for visualisation), directly regressing out these uninteresting factors is required to obtain "corrected" expression values for further analysis.

davismcc / scaterPaperExtras

Page 5: Data normalisation and batch correction: Paragraph 1 #19