Rethink how coefficients of variation are computed

(The current issue is more of a food for thought than an actual issue. Still, I think our package could clarify the concept of coefficient of variation (CV) a bit further by improving the implementation of medianCVperCell() and its documentation.)

I feel a lot of confusion about how to compute and interpret coefficients of variation in SCP. In the beginning, I thought that CVs were computed using the standard deviation and mean of the absolute intensities among features belonging to the same group (eg peptides belonging to proteins) within each cell. Computing CV this way is wrong and leads to a useless metric because peptide intensity is influenced by other properties than the peptide amount (eg ionization efficiency). So, we do not expect a small variance across peptides because we cannot compare intensities from different peptides. This is why MS-based proteomics analyses rely on relative quantification rather than absolute quantification.

Action 1: the package should clearly state that one should not compute feature CV without any form of feature normalization. No normalization is currently the default in medianCVperCell.

Another way to compute CV is by computing the CV across cells. This is briefly discussed in the initial SCP recommendations [1]:

When comparing CVs across different analytical or experimental conditions, it is imperative to account for varying dataset sizes; that is, a rigorous comparison between experimental methods would rely on peptides and proteins identified and quantified across all samples, rather than also including peptides and proteins identified uniquely in individual experiments

However, I personally am not convinced by CV across cells. On top of the limitation addressed above, sample CV is only valid if you expect homogeneous cells within each condition. Cell homogeneity is already questionable for cell lines (eg effect of cell cycle), but would clearly not hold for tissue samples. So computing CV across cells is not a good QC as it not only assesses technical variability, but also biological variability, and you clearly don't want to remove a feature during QC because it has higher biological variability.

Therefore, the recommendations also state:

The CV estimated from the relative levels of different peptides originating from the same protein may provide a useful measure of reliability.

But this brings an open question: how to compute these relative levels? The Slavov lab suggest with the SCoPE2 pipeline to first normalize samples (divide by median) then normalize features (divide by mean). I'm not convinced this is the best way to derive relative quantification values. I think providing more flexibility regarding the feature normalization before computing CVs would be beneficial.

Action2: following up to Action1, we should remove the norm argument and instead clearly document that medianCVperCell() expects relative intensities. We could document the SCoPE2 normalization using the QFeatures processing flow using normalize() instead of hiding the normalization within medianCVperCell().

The recommendations also warn that even if CV is estimated on relative quantification, this evaluation "is limited by the existence of proteoforms".

Action3: maybe this issue could be solved by assessing proteoform quantification [2]. The HIquant method is implement as python code. This is not directly linked to CV and could be time consuming (so topic for another issue?).

[1] Gatto, Laurent, Ruedi Aebersold, Juergen Cox, Vadim Demichev, Jason Derks, Edward Emmott, Alexander M. Franks, et al. 2023. “Initial Recommendations for Performing, Benchmarking and Reporting Single-Cell Proteomics Experiments.” Nature Methods, March. https://doi.org/10.1038/s41592-023-01785-3.

[2] Malioutov, Dmitry, Tianchi Chen, Edoardo Airoldi, Jacob Jaffe, Bogdan Budnik, and Nikolai Slavov. 2019. “Quantifying Homologous Proteins and Proteoforms.” Molecular & Cellular Proteomics: MCP 18 (1): 162–68.

UCLouvain-CBIO / scp

Rethink how coefficients of variation are computed #38