carlosproca / cdnormbio

R package: Condition-Decomposition Normalization for Biological Applications
MIT License
4 stars 2 forks source link

SVCD in the case of an inequal number of replicates per condition #5

Closed michaelpierrelee closed 7 years ago

michaelpierrelee commented 7 years ago

Dear Dr. Roca,

Following your answers in the issues #3 and #4, indeed, I have 3 conditions with 3 replicates and 1 condition with 2 replicates (one was removed because of a bad reproducibility). So the size factors generated by SVCD have random variations. In order to have more accurate data, I ran 50 times the normalization and I put the means into DESeq2, but it is not clear for me how to take the uncertainties into account. What could be the best way forward?

Thanks a lot for your answers.

carlosproca commented 7 years ago

Hi Michael,

To answer, I would need to know with a bit more detail what you are asking for. What do mean exactly by taking the uncertainties (of the SVCD factors) into account? Do you mean understanding or having an intuition about how these uncertainties would affect downstream analyses?

michaelpierrelee commented 7 years ago

For instance, after run 50 times SVCD, I got the mean and the standard deviation for each size factor:

sample mean std
A1 0.72 0.11
A2 0.63 0.1
A3 0.64 0.1
B1 0.73 0.08
B2 0.69 0.08
B3 0.81 0.09
C1 1.18 0.06
C2 1.14 0.06
C3 1.09 0.06
D1 1.85 0.25
D2 1.85 0.25

So, I am wondering if I have to take the means as size factors and not just the values from one run, especially for the last two. Because there is an uncertainty on the size factors, I guess we could take that into account in the calculation of p-values.

carlosproca commented 7 years ago

Hi Michael,

There are several details that would be interesting to clarity.

The uncertainty in the column std does not account for the full uncertainty, or estimation error, in the normalization factors. The reason for this is that those numbers only represent the estimation error for the factors obtained in the between-condition normalization. You are "seeing" this error as a side effect of the equalization of the sample numbers per condition.

You can convince yourself of this by running the normalization several times without including the samples of condition D. You will obtain the same normalization factors in all runs. This is what happens for any normalization method without randomized steps. And crucially, this does not imply that there is no estimation error or uncertainty in the resulting normalization factors.

Back to your dataset (condition D included), if you obtain for each run the within-condition and between-condition normalization factors, you will see that the within-condition factors are equal in all runs, while the between-condition factors are the ones that vary. This explains why the numbers in the column std are almost equal for each condition.

Your idea of running the normalization several times is good. And by using the average of the obtained factors, you are improving their estimation. Also, the variability or uncertainty that you are seeing can be used to estimate the overall error in the normalization. But the estimation is a bit more complicated than getting the variance or standard deviation from the factors.

I would suggest the following:

The factor 2 comes from the between-condition factors being themselves an average of two normalization factors (in your dataset). Each n represents the number of features used in each kind of normalization, whose error is proportional to the inverse of n.

This total error corresponds to the factors obtained from a single normalization. If you average the factors from N normalizations, you should substitute the 1 in the sum of the last expression by the inverse of N, which reflects the reduction of error resulting from the averaging, which only affects the between-condition "part" of the error.

I hope this helps. Please ask if something remains unclear.

michaelpierrelee commented 7 years ago

Dear Dr. Roca, Thanks for your answer.

The equation is a bit unclear for me. Does the "n" come from normalize.result$h0.features returned by SVCD, there is only one type of features, or this explanation below from your paper? Moreover, the factor "2" is for the 4 conditions, not 3 for A, B and C?

For a quick way, if I got the error for each normalization factor, I could calculate the error on the log fold-change, by propagation of uncertainty, and just exclude the DEGs overlapping 0. Because DESeq2 also provides one, I guess I would take the highest error.

In the within-condition normalizations, the features are gene expression levels, with one observation per sample of the corresponding experimental condition. In the between-condition normalization, the features are means of gene expression levels, with one observation per condition.

carlosproca commented 7 years ago

Hi Michael,

The two n 's in the equation above correspond to each kind of normalization, as indicated. For the normalizations within condition, n is simply the total number of usable features in the dataset. Usable features means here features that have no missing values for any sample. For the normalization between conditions, n is the number of features identified as non-varying across conditions, which can be obtained with length( normalize.result$h0.feature ).

The number 2 in the equation comes from the minimum number of samples per condition in the dataset. From the table above, I understand that your dataset has 3 samples for conditions A, B, and C, and 2 samples for condition D. Loosely speaking, the reason for a factor k is that the "between-condition factors" are averages of k "within-condition factors", and therefore the variance of the latter is k times the variance of the former.

Concerning the use of the estimated normalization error(s) in downstream analyses, I would stick with a single measure of error for the case of SVCD normalization, because the distribution of error is the same for all samples, in the statistical model underlying this algorithm. I think that the best way to integrate this normalization error in the analysis of differential expression implemented by DESeq2 would be to use it to define a threshold for minimum differential expression. See for example the subsection "Hypothesis tests with thresholds on effect size" in the paper by Love et al. Note that this would make adequate use of the shrunken fold changes and dispersion factors (not the size factors) estimated by DESeq2.