deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicPCA low values? #160

Closed gtrichard closed 6 years ago

gtrichard commented 6 years ago

Hello,

I ran hicPCA from the develop branch (commit 6c4210a) on multiple Hi-C matrices from the Drosophila genome (dm6), including already published ones, and I got really low PCA values (in the -0.07 to 0.07 range) along the entire genome. Those values should be higher, like in the range of -40 to 40 right? Because difference between samples with such low values are a bit hard to quantify.

Below, you can find a link to the matrix file and the bedgraph file containing the pca1 score produced with the following command line:

hicPCA \
-m matrix_nb50_corrected_-2_2.h5 \
-o matrix_nb50_pca1_full.bedgraph\
-noe 1 \
-f bedgraph
INFO:hicexplorer.HiCMatrix:Number of poor regions to remove: 297 {'2R': 57, '3R': 13, '4': 6, '3L': 68, '2L': 51, 'X': 102}
INFO:hicexplorer.HiCMatrix:found existing 297 nan bins that will be included for masking 
/data/akhtar/group/richard/Softwares/anaconda2/lib/python2.7/site-packages/hicexplorer/utilities.py:947: RuntimeWarning: divide by zero encountered in double_scalars
  expected_interactions[i] /= pLength_chromosome - (pChromosome_count * i)

https://drive.google.com/open?id=1SN9oiuFRrfjKGJXhHINERv5BvLhoas5G

joachimwolff commented 6 years ago

Lieberman-Aiden (2009) gives no scale for the eigenvectors, Fortin and Hansen (2015) have always values in the -1 to 1 range. On the other hand, the HOMER documentation uses in their example a dynamic range of -39.92 to 65.59 for the first and -34.21 to 35.93 for the second eigenvector. The range of the eigenvector values depends on the values of the matrix it is based on and second, on the associated eigenvalue. What we do in HiCExplorer is the same as in Lieberman-Aiden. We compute a obs_exp matrix and based on this a Pearson correlation matrix. Important to understand now the range of the eigenvector values is, that the Pearson correlation itself has a range of -1 (negativ correlation) over 0 (no correlation) to 1 (perfect positive correlation). Based on the Pearson correlation matrix we compute the covariance matrix which gives, with a matrix with values in -1, 1 range, in many cases values in this range too. Therefore the eigenvector values are usually in the same range. The reason that HOMER gives you higher eigenvector values is, that they use a different approach to normalize that matrix.

@DonStephano did a correlation of our eigenvectors with the ones from HOMER and achieved a correlation of 0.96. It would be very helpful for us if you can confirm this result with your own data.

Edit: Less important than the range of the values is the correlation with the distribution of genes and with features of open / closed chromatin, see e.g. Lieberman-Aiden 2009, Fig 3G.

gtrichard commented 6 years ago

Ok, so then I can use these data, that's a very good point, I'll just have to orientate the positive and negative values according to the correlation of positive / negative values depending on chromatin states / gene density / gene expression. It's a possibility that one sample starts as correctly oriented while another one is reversed right?

I'll take a look at HOMER's result on these data. Since I have to reprocess the data from scratch with HOMER it might take sometime.

Thank you.

joachimwolff commented 6 years ago

Yes the values flip sometimes their sign. I think this is caused by the usage of approximate algorithms to calculate the eigenvalues / eigenvectors (please correct me on this if I'm wrong). Additional, you must always take into account the second eigenvector as well. The first one and vice versa the second one do not describe all features of a matrix, that is the nature of eigenvalues / eigenvectors.

DonStephano commented 6 years ago

Unfortunately, I hadn't time to check more PC1-correlations of HiCExplorer and HOMER. I will do this after Xmas and try both tools with the same data. I keep you updated.

gtrichard commented 6 years ago

I found that merging bins gives a greater range of values allowing a better scoring of differences between multiple PCA1 eigenvector (different conditions).

LeilyR commented 6 years ago

Hey I also have a question regarding to HiCPCA. Why the eigen vectors were not sorted by their corresponding eigen values? Don't we need those with the highest value to be returned?

gtrichard commented 6 years ago

I think that the eigenvectors are sorted by how much of the variability of the data each of them explain.

@joachimwolff Can you confirm?

joachimwolff commented 6 years ago

Valid point @LeilyR. We don't sort the eigenvectors according to their corresponding eigenvalue and these by the highest magnitude. Lieberman-Aiden (Comprehensive mapping of long-range interactions reveals folding principles of the human genome. 2009) writes they do so, so I think we should do it too. Nevertheless, we checked the correlation of our results with those from HOMER and had a high correlation of > 0.9.

I have to check if the sorting is done implicitly by the used eigen function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eig.html

LeilyR commented 6 years ago

Thank you Joachim.

Also as far as I know the scipy eigen function returns a non-sorted one.