SystemsGenetics / GEMprep

GNU General Public License v3.0
3 stars 2 forks source link

Verify quantile normalization results between R and python #3

Open bentsherman opened 5 years ago

bentsherman commented 5 years ago

The only reason why normalize.R is still around is because we have not been able to match the output of normalize.quantiles() in normalize.py. The scikit-learn implementation of quantile normalization can't handle missing values (nans), so we've written our own implementation because it's a simple algorithm. However, it doesn't match the R implementation. I've run some experiments to try and figure out the cause, so I will document my findings here.

I've been working with the old TCGA (5 cancer) matrix. First I applied a log2 transform to the matrix, then I applied quantile normalization using R and python separately, producing two normalized matrices. I wrote a script to quantify the difference in expression values between two matrices, so I used this script to compare the two quantile implementations.

$ python bin/validate.py TCGA.fpkm.0.r.txt TCGA.fpkm.0.py.txt 
Loaded TCGA.fpkm.0.r.txt (73599, 2016)
Loaded TCGA.fpkm.0.py.txt (73599, 2016)
warning: column names do not match
number of mismatched nans: 0
min error:     0.000000
avg error:     0.535444
max error:    18.932696

So at a broad glance, these two results are slightly different. Expression values range from about 0 to 20, so the average error is small but noticeable. The columns do actually match, it's just that R for some reason replaces '-' with '.' in the column names. Interestingly the occurence of nans in each matrix are identical.

I thought that maybe the difference arises from how our implementation handles the nans, so I re-ran this test with matrices that don't have nans. To do that, I do a log2(1 + x) transform instead of log2(x). Since raw expression values range from 0 to infinity, log2(1 + x) will never return a nan in this case.

$ python bin/validate.py TCGA.fpkm.1.r.txt TCGA.fpkm.1.py.txt 
Loaded TCGA.fpkm.1.r.txt (73599, 2016)
Loaded TCGA.fpkm.1.py.txt (73599, 2016)
warning: column names do not match
number of mismatched nans: 0
number of errors: 56823902
min error:     0.000000
avg error:     0.020964 +/-     0.147583
max error:     5.201935

Much better, but still slightly off. I really need to find the source code for the R implementation in order to know exactly how they do it. I found the R library but the implementation itself is written in C and I couldn't find the C source files.

Also need to post density plots here when I get the chance.

bentsherman commented 5 years ago

Okay time for some plots. Starting with comparing log2(x + 0) and log2(x + 1).

TCGA.fpkm.txt (no normalization): TCGA fpkm density

TCGA.fpkm.0.txt (log2 transform) TCGA fpkm 0 density

TCGA.fpkm.1.txt (log2(x + 1) transform) TCGA fpkm 1 density

bentsherman commented 5 years ago

Now let's take the log2(x) transformed GEM and compare the R and python implementations of quantile normalization.

TCGA.fpkm.0.r.txt (R implementation) TCGA fpkm 0 r density

TCGA.fpkm.0.py.txt (Python implementation) TCGA fpkm 0 py-nanmean density

bentsherman commented 5 years ago

And we'll compare the log2(x+1) versions as well, because they don't contain nans.

TCGA.fpkm.1.r.txt TCGA fpkm 1 r density

TCGA.fpkm.1.py.txt TCGA fpkm 1 py-nanmean density

bentsherman commented 5 years ago

Some observations from these plots:

zhyemqww commented 1 year ago

In the algorithm, the implementation is achieved as follows:

Firstly, we obtain the count of non-NaN data in the column, denoted as N, and store the NaN data in the Narray array. We designate the number of rows in the matrix as Row, and the number of columns as Col. Our goal is to calculate the average at the current position i.

We denote i/Row as P, representing the current position's percentile within the total number of rows, which is essentially the essence of quantile normalization and the reason behind its name.

We denote 1 + (N - 1) * P as index, because it's clear that N < Row, and we are certainly seeking the percentile rank of this position within the non-NaN data. We will calculate the mean using this percentile rank after certain determinations and calculations.

Due to precision issues with double data types, values that are close to integers are considered as integers. This precision is controlled by a parameter named DOUBLE_EPS in the code.

For instance, if the condition 4 * DOUBLE_EPS > |num - 4| holds, we consider num to be equal to 4. You might find it strange why we multiply by 4. This is how the source code is implemented. In fact, even if we multiply by 3 instead of 4, it would still work; it's all about precision. In other words, if index is close to an integer, then Narray[index]/Col is taken as the average for that point.

If index happens to be an integer, then that's even better. No precision concerns need to be taken into account, and we proceed with the calculation process described above.

If index falls between two integers a and b, we first calculate the distance between index and a, denoted as La. Then we calculate (Narray[a] La + Narray[b] (1-La))/Col, which means we add the two endpoints proportionally based on the distance to obtain the average.

By employing the above methods, we can effectively map an array smaller than Row to an array of size Row. Thus, this addresses the target construction when encountering NaN values.