Open bentsherman opened 5 years ago
Okay time for some plots. Starting with comparing log2(x + 0) and log2(x + 1).
TCGA.fpkm.txt (no normalization):
TCGA.fpkm.0.txt (log2 transform)
TCGA.fpkm.1.txt (log2(x + 1) transform)
Now let's take the log2(x) transformed GEM and compare the R and python implementations of quantile normalization.
TCGA.fpkm.0.r.txt (R implementation)
TCGA.fpkm.0.py.txt (Python implementation)
And we'll compare the log2(x+1) versions as well, because they don't contain nans.
TCGA.fpkm.1.r.txt
TCGA.fpkm.1.py.txt
Some observations from these plots:
The log2(x+1) method seems kinda weird, basically all of the samples which would be nan are now [0, 1]. I'd be curious to see what the pairwise plots look like but I suspect we would just end up filtering them out with something like --minexpr 1
.
The Python implementation definitely works when there aren't any nans, looking at the last plot. When there are nans, the distributions don't line up exactly.
The R implementation seems to behave opposite from the Python version. In the presence of nans it still transforms the sample distributions to be near identical. So then why does it breakdown a little bit when there are no nans?
In the algorithm, the implementation is achieved as follows:
Firstly, we obtain the count of non-NaN data in the column, denoted as N, and store the NaN data in the Narray array. We designate the number of rows in the matrix as Row, and the number of columns as Col. Our goal is to calculate the average at the current position i.
We denote i/Row as P, representing the current position's percentile within the total number of rows, which is essentially the essence of quantile normalization and the reason behind its name.
We denote 1 + (N - 1) * P as index, because it's clear that N < Row, and we are certainly seeking the percentile rank of this position within the non-NaN data. We will calculate the mean using this percentile rank after certain determinations and calculations.
Due to precision issues with double data types, values that are close to integers are considered as integers. This precision is controlled by a parameter named DOUBLE_EPS in the code.
For instance, if the condition 4 * DOUBLE_EPS > |num - 4| holds, we consider num to be equal to 4. You might find it strange why we multiply by 4. This is how the source code is implemented. In fact, even if we multiply by 3 instead of 4, it would still work; it's all about precision. In other words, if index is close to an integer, then Narray[index]/Col is taken as the average for that point.
If index happens to be an integer, then that's even better. No precision concerns need to be taken into account, and we proceed with the calculation process described above.
If index falls between two integers a and b, we first calculate the distance between index and a, denoted as La. Then we calculate (Narray[a] La + Narray[b] (1-La))/Col, which means we add the two endpoints proportionally based on the distance to obtain the average.
By employing the above methods, we can effectively map an array smaller than Row to an array of size Row. Thus, this addresses the target construction when encountering NaN values.
The only reason why
normalize.R
is still around is because we have not been able to match the output ofnormalize.quantiles()
innormalize.py
. The scikit-learn implementation of quantile normalization can't handle missing values (nans), so we've written our own implementation because it's a simple algorithm. However, it doesn't match the R implementation. I've run some experiments to try and figure out the cause, so I will document my findings here.I've been working with the old TCGA (5 cancer) matrix. First I applied a log2 transform to the matrix, then I applied quantile normalization using R and python separately, producing two normalized matrices. I wrote a script to quantify the difference in expression values between two matrices, so I used this script to compare the two quantile implementations.
So at a broad glance, these two results are slightly different. Expression values range from about 0 to 20, so the average error is small but noticeable. The columns do actually match, it's just that R for some reason replaces '-' with '.' in the column names. Interestingly the occurence of nans in each matrix are identical.
I thought that maybe the difference arises from how our implementation handles the nans, so I re-ran this test with matrices that don't have nans. To do that, I do a log2(1 + x) transform instead of log2(x). Since raw expression values range from 0 to infinity, log2(1 + x) will never return a nan in this case.
Much better, but still slightly off. I really need to find the source code for the R implementation in order to know exactly how they do it. I found the R library but the implementation itself is written in C and I couldn't find the C source files.
Also need to post density plots here when I get the chance.