non-converging data set

karl616 commented 7 years ago

Hi,

would you have any recommendations on how to handle non-converging data sets?

I tried to use normalize.svcd on a 3 data sets x 4 conditions matrix. It didn't converge. My naive wish were to increase the number of iterations. This isn't the way to do it though...

Are there some guidelines to be found?

carlosproca commented 7 years ago

Hi karl616,

In general, one would need to know which step is failing to converge. That is, which condition among the within-condition normalizations, or which one of the two between-condition normalizations (the one looking for no-variation genes, or the actual one). With that information, one would try to increase the corresponding value in the parameter convergence.threshold.

If you wish, I can guide you doing so. For that, I would need to see the convergence log, which is printed to console with the parameter verbose = TRUE. Note that the log doesn't show your data. And you can also change the condition names to something like a, b, c, d. It would also be interesting to see what happens with normalize.mediancd.

karl616 commented 7 years ago

Thanks! It took some time to generate the mediancd normalization, but it came through without stopping. I have attached the logs from both methods.

Do I understand the normalize.svcd log correctly in that it breaks down in the normalization of the second group? And does this correspond to the single step parameters?

normalize.mediancd.txt normalize.svcd.txt

carlosproca commented 7 years ago

Hi karl616,

Thanks to you for using these normalization methods!

And yes, the log shows that the problem of normalize.svcd was with the second group. I see that the run had converged from about the 15th step, but the error was not small enough for the default convergence parameters.

To be on the safe side, I assigned quite strict default convergence parameters. The third column of the log gives the estimation of the (relative) statistical error, while the forth one gives the actual relative difference between steps. This numerical error needs to be 0.01 below the estimation of statistical error in any single step. Or the numerical error needs to be 0.1 below the estimation of statistical error in 10 consecutive steps. Note that normalize.svcd had a relative error, for the second group and from the 15th step onward, of about 0.02%. You could try to change the convergence parameters from the default value of convergence.threshold = c(0.01, 0.1, 0.01, 1) to convergence.threshold = c(0.01, 0.2, 0.01, 1), and it should converge easily.

For the normalize.mediancd log, the convergence looks pretty ok. It has identified around 43,500 features for normalizing between conditions, which is a very good number because it is quite large.

However, there is something that strikes me a bit from both logs. It is the values in the second column, which are the "amplitudes" of the detected normalization factors (calculated as the standard deviation of these numbers, considered as samples themselves). I would guess that you are not transforming your data to log2 scale or similar. If this is the case, I would suggest that you try to do so, at least for normalizing. Internally, both svcd and mediancd normalization assume such kind of scale, because they look for the normalization factors as additive factors. And, in any case, it is always possible to use these normalization methods to just extract the normalization factors, and then apply the factors to the data, in the scale that you prefer to use for your problem. Surely you know that if you have count data, like RNA-Seq data, a simple option is using pseudo-counts log2( count + 1 ).

Other values that are a bit too much out of range in the log for normalize.svcd are the values between [], which give the Watson U2 statistic for each iteration. These numbers indicate the degree of permutation symmetry achieved during the normalization. Ideally, values in the last steps should be below 1, but real data usually don't go so far. If your data is not in log scale, then I would expect much lower values using data transformed in such a way.

I will add a bit more information to the help pages, to indicate what the columns in the logs mean and to clarify the expected log scale in expression.data. The paper explains this, but the help pages don't mention it explicitly.

karl616 commented 7 years ago

Thank you very much for the help, and I apologize for not doing my research correctly. This is a problematic data set and when I found your article I thought I give it a chance. I missed the log transform requirement. It solved the issue. At least it converges now, and that without relaxing the thresholds.

With regard to the high numbers, could that be influenced by the size of the data set? My data set consists of ~160k loci. I'm not sure how that impacts the scores.

normalize.svcd.log.txt

carlosproca commented 7 years ago

Hi karl616,

You are welcome. And no problem, no need to apologize. The help pages could have been more explicit about the log scale of the input data.

Also, the Watson U2 statistic is virtually independent of the number of samples when n>100, for a fixed alpha. However, in case of distribution differences between two collection of samples (for the use of this statistic here, when sample data is not permutationally symmetric), it grows linearly with the number of samples n. So yes, having the order of 10^5 features can cause quite large values of U2. Currently, I do not have a satisfactory explanation for this issue, in general. My current take on it is that it would be caused by some feature of the data itself, not by the normalization or any previous preprocessing step. This is one of the questions I am working on now.

If the processing of your data is challenging, I would suggest the following:

Plot and compare the normalization factors of both normalize.svcd and normalize.mediancd (and any other method you are trying). They can be obtained with dataNorm.results$offset. The factors from the within- and between-condition normalizations are also similarly available. By comparing these factors you will gain an intuition about how much the normalization step would affect downstream analyses.
normalize.svcd and normalize.mediancd only deal with between-sample normalization. If your data require within-sample normalization (like scaling by gene length in RNA-Seq), this needs to be done before.
You have many features in your data to use for normalizing (from the logs, current estimated errors are about 0.1%). If you can make a rough selection of good/bad loci for normalizing, based on any criterion fitting your assay, then the gain by using good loci will surely overcome the loss by increasing error in the estimates of the normalization factors. The chosen subset can be introduced with the parameter restrict.feature.

I am happy to help if you have more questions about these normalization methods. And again, many thanks for using them.

michaelpierrelee commented 7 years ago

Dear Dr. Roca, as I'm working on your normalization methods, I noticed an interesting feature related to this topic. Especially for MedianCD, I have also some convergence problems, even with pseudocounts. But when I run again the script, that is gone. By analyzing the log, there is this kind of pattern (not always the same):

  104 0.363576 0.268636 00 00 44
  105 0.329357 0.622981 00 00 64
  106 0.348646 0.354618 00 00 55
  107 0.272929 0.380057 00 00 98
  108 0.288937 0.190161 00 00 64
  109 0.28848 0.1818 00 00 64

  110 0.363576 0.268636 00 00 44
  111 0.329357 0.622981 00 00 64
  112 0.348646 0.354618 00 00 55
  113 0.272929 0.380057 00 00 98
  114 0.288937 0.190161 00 00 64
  115 0.28848 0.1818 00 00 64

  116 0.363576 0.268636 00 00 44
  117 0.329357 0.622981 00 00 64
  118 0.348646 0.354618 00 00 55
  119 0.272929 0.380057 00 00 98
  120 0.288937 0.190161 00 00 64
  121 0.28848 0.1818 00 00 64

It looks like a minimal local. It is not a real problem. When I see that, I just stop the script and run it again. Do you know why?

Thanks for your time.

carlosproca commented 7 years ago

Hi Michael,

The last number in each row shows the number of no-variation features that have been found so far, in order to normalize between conditions. These numbers are quite low, which implies that associated errors in the estimation of normalization factors are quite high, so the algorithm does not converge. This is in fact the "correct" behavior, because the algorithm has as a requisite that enough number of no-variation features must be detected, so that errors in the normalization factors are sufficiently small.

Median normalization is a poorer (noisier) estimator for the normalization factors than Standard-Vector normalization. This would explain why you find this effect specially with MedianCD normalization. Apart from considering carefully the distribution of variance vs mean of the samples, and ignoring low-count features, for the case of MedianCD, I would also consider the distributions across features, sample by sample. It may be possible to identify a better quantile for normalizing than the median itself, by reducing the variance of the corresponding estimator, and use it with the parameter normalization.probability.

Obtaining different results from exactly the same dataset and call arguments can only happen when, in the actual data used for normalizing, there are differences in the number of samples per condition. This is explained in detail in the supplementary material of the paper, but the idea is that all the condition means (under the null hypothesis) must be samples of the same distribution (per feature). This does not hold when the conditions have different number of samples, so random sampling is done to "equalize" the number of samples across conditions. This is the only random step in the calculations, both for MedianCD and SVCD normalization.

I hope this answers your question. Many thanks again for using these normalization methods.

carlosproca / cdnormbio

non-converging data set #3