Closed gtrichard closed 6 years ago
Lieberman-Aiden (2009) gives no scale for the eigenvectors, Fortin and Hansen (2015) have always values in the -1 to 1 range. On the other hand, the HOMER documentation uses in their example a dynamic range of -39.92 to 65.59 for the first and -34.21 to 35.93 for the second eigenvector. The range of the eigenvector values depends on the values of the matrix it is based on and second, on the associated eigenvalue. What we do in HiCExplorer is the same as in Lieberman-Aiden. We compute a obs_exp matrix and based on this a Pearson correlation matrix. Important to understand now the range of the eigenvector values is, that the Pearson correlation itself has a range of -1 (negativ correlation) over 0 (no correlation) to 1 (perfect positive correlation). Based on the Pearson correlation matrix we compute the covariance matrix which gives, with a matrix with values in -1, 1 range, in many cases values in this range too. Therefore the eigenvector values are usually in the same range. The reason that HOMER gives you higher eigenvector values is, that they use a different approach to normalize that matrix.
@DonStephano did a correlation of our eigenvectors with the ones from HOMER and achieved a correlation of 0.96. It would be very helpful for us if you can confirm this result with your own data.
Edit: Less important than the range of the values is the correlation with the distribution of genes and with features of open / closed chromatin, see e.g. Lieberman-Aiden 2009, Fig 3G.
Ok, so then I can use these data, that's a very good point, I'll just have to orientate the positive and negative values according to the correlation of positive / negative values depending on chromatin states / gene density / gene expression. It's a possibility that one sample starts as correctly oriented while another one is reversed right?
I'll take a look at HOMER's result on these data. Since I have to reprocess the data from scratch with HOMER it might take sometime.
Thank you.
Yes the values flip sometimes their sign. I think this is caused by the usage of approximate algorithms to calculate the eigenvalues / eigenvectors (please correct me on this if I'm wrong). Additional, you must always take into account the second eigenvector as well. The first one and vice versa the second one do not describe all features of a matrix, that is the nature of eigenvalues / eigenvectors.
Unfortunately, I hadn't time to check more PC1-correlations of HiCExplorer and HOMER. I will do this after Xmas and try both tools with the same data. I keep you updated.
I found that merging bins gives a greater range of values allowing a better scoring of differences between multiple PCA1 eigenvector (different conditions).
Hey I also have a question regarding to HiCPCA. Why the eigen vectors were not sorted by their corresponding eigen values? Don't we need those with the highest value to be returned?
I think that the eigenvectors are sorted by how much of the variability of the data each of them explain.
@joachimwolff Can you confirm?
Valid point @LeilyR.
We don't sort the eigenvectors according to their corresponding eigenvalue and these by the highest magnitude. Lieberman-Aiden (Comprehensive mapping of long-range interactions reveals folding principles of the human genome. 2009) writes they do so, so I think we should do it too. Nevertheless, we checked the correlation of our results with those from HOMER and had a high correlation of > 0.9
.
I have to check if the sorting is done implicitly by the used eigen function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eig.html
Thank you Joachim.
Also as far as I know the scipy eigen function returns a non-sorted one.
Hello,
I ran hicPCA from the develop branch (commit 6c4210a) on multiple Hi-C matrices from the Drosophila genome (dm6), including already published ones, and I got really low PCA values (in the -0.07 to 0.07 range) along the entire genome. Those values should be higher, like in the range of -40 to 40 right? Because difference between samples with such low values are a bit hard to quantify.
Below, you can find a link to the matrix file and the bedgraph file containing the pca1 score produced with the following command line:
https://drive.google.com/open?id=1SN9oiuFRrfjKGJXhHINERv5BvLhoas5G