SVCD in the case of an inequal number of replicates per condition

Dear Dr. Roca,

Following your answers in the issues #3 and #4, indeed, I have 3 conditions with 3 replicates and 1 condition with 2 replicates (one was removed because of a bad reproducibility). So the size factors generated by SVCD have random variations. In order to have more accurate data, I ran 50 times the normalization and I put the means into DESeq2, but it is not clear for me how to take the uncertainties into account. What could be the best way forward?

Thanks a lot for your answers.

Hi Michael,

To answer, I would need to know with a bit more detail what you are asking for. What do mean exactly by taking the uncertainties (of the SVCD factors) into account? Do you mean understanding or having an intuition about how these uncertainties would affect downstream analyses?

For instance, after run 50 times SVCD, I got the mean and the standard deviation for each size factor:

sample	mean	std
A1	0.72	0.11
A2	0.63	0.1
A3	0.64	0.1
B1	0.73	0.08
B2	0.69	0.08
B3	0.81	0.09
C1	1.18	0.06
C2	1.14	0.06
C3	1.09	0.06
D1	1.85	0.25
D2	1.85	0.25

So, I am wondering if I have to take the means as size factors and not just the values from one run, especially for the last two. Because there is an uncertainty on the size factors, I guess we could take that into account in the calculation of p-values.

Hi Michael,

There are several details that would be interesting to clarity.

The uncertainty in the column std does not account for the full uncertainty, or estimation error, in the normalization factors. The reason for this is that those numbers only represent the estimation error for the factors obtained in the between-condition normalization. You are "seeing" this error as a side effect of the equalization of the sample numbers per condition.

You can convince yourself of this by running the normalization several times without including the samples of condition D. You will obtain the same normalization factors in all runs. This is what happens for any normalization method without randomized steps. And crucially, this does not imply that there is no estimation error or uncertainty in the resulting normalization factors.

Back to your dataset (condition D included), if you obtain for each run the within-condition and between-condition normalization factors, you will see that the within-condition factors are equal in all runs, while the between-condition factors are the ones that vary. This explains why the numbers in the column std are almost equal for each condition.

Your idea of running the normalization several times is good. And by using the average of the obtained factors, you are improving their estimation. Also, the variability or uncertainty that you are seeing can be used to estimate the overall error in the normalization. But the estimation is a bit more complicated than getting the variance or standard deviation from the factors.

I would suggest the following:

Do all the calculation considering the factors as additive, that is, in log2 scale. This means that the numbers in the column mean should vary around 0, not around 1. And the numbers in the column std should be exactly equal for all the samples in the same condition.
Average the four different numbers in column std (squared, as variances), to obtain a final estimation for the error of the normalization between conditions.
Calculate the total estimation error with the following expression (again, errors squared, as variances):
$\text{error}_\text{total}=\text{error}_\text{within-condition}\,+\,\text{error}_\text{between-condition}=\left(2 \frac{n_\text{between-condition}}{n_\text{within-condition}}+1\right)\text{error}_\text{between-condition}$

The factor 2 comes from the between-condition factors being themselves an average of two normalization factors (in your dataset). Each n represents the number of features used in each kind of normalization, whose error is proportional to the inverse of n.

This total error corresponds to the factors obtained from a single normalization. If you average the factors from N normalizations, you should substitute the 1 in the sum of the last expression by the inverse of N, which reflects the reduction of error resulting from the averaging, which only affects the between-condition "part" of the error.

Consider this total error as a measure of the uncertainty in the estimation of the normalization factors. As a standard deviation and in relative terms, I would expect something around 1-5%.
It would possible to double-check all this, if one had a good statistical model for the data, without introducing normalization factors. In that case, by repeating the normalizations as your are doing, the obtained normalization factors would be just the estimation errors. And their magnitude should match the estimation above. You could try the negative binomial model implemented by makeExampleDESeqDataSet() in DESeq2, as in one of the examples here, adjusting means and variances to your data.
Finally, how can this estimation error be used? It provides you a measure of the uncertainty in the overall level for each sample, which can obviously affect downstream analyses. For example, any differential expression of similar order of magnitude to this estimation error should not be trusted.
With regard to p-values, it depends on which hypothesis test those p-values refer to. If you are asking about the normalization factors themselves, that is, if they are significant or not (if they are different from 0 in log2 scale, or not, I assume), with the values that you are showing in the table, they really look significant. To say something more precise, a statistical model would be needed for the factors and their variation. This depends on the design of the experiment, the biological variation in the samples, and the normalization procedure. A simpler approach could be tried by assuming some theoretical distribution for the normalization factors, under the null hypothesis. But again, with the numbers that you are obtaining, you could compare the standard deviation of the normalization factors themselves with the estimation error as standard deviation (both in log2 scale) and use the Chebyshev's inequality to obtain a rough, but I would expect quite small, p-value.

I hope this helps. Please ask if something remains unclear.

Dear Dr. Roca, Thanks for your answer.

The equation is a bit unclear for me. Does the "n" come from normalize.result$h0.features returned by SVCD, there is only one type of features, or this explanation below from your paper? Moreover, the factor "2" is for the 4 conditions, not 3 for A, B and C?

For a quick way, if I got the error for each normalization factor, I could calculate the error on the log fold-change, by propagation of uncertainty, and just exclude the DEGs overlapping 0. Because DESeq2 also provides one, I guess I would take the highest error.

In the within-condition normalizations, the features are gene expression levels, with one observation per sample of the corresponding experimental condition. In the between-condition normalization, the features are means of gene expression levels, with one observation per condition.

Hi Michael,

The two n 's in the equation above correspond to each kind of normalization, as indicated. For the normalizations within condition, n is simply the total number of usable features in the dataset. Usable features means here features that have no missing values for any sample. For the normalization between conditions, n is the number of features identified as non-varying across conditions, which can be obtained with length( normalize.result$h0.feature ).

The number 2 in the equation comes from the minimum number of samples per condition in the dataset. From the table above, I understand that your dataset has 3 samples for conditions A, B, and C, and 2 samples for condition D. Loosely speaking, the reason for a factor k is that the "between-condition factors" are averages of k "within-condition factors", and therefore the variance of the latter is k times the variance of the former.

Concerning the use of the estimated normalization error(s) in downstream analyses, I would stick with a single measure of error for the case of SVCD normalization, because the distribution of error is the same for all samples, in the statistical model underlying this algorithm. I think that the best way to integrate this normalization error in the analysis of differential expression implemented by DESeq2 would be to use it to define a threshold for minimum differential expression. See for example the subsection "Hypothesis tests with thresholds on effect size" in the paper by Love et al. Note that this would make adequate use of the shrunken fold changes and dispersion factors (not the size factors) estimated by DESeq2.

carlosproca / cdnormbio

SVCD in the case of an inequal number of replicates per condition #5