gersteinlab / HiC-spector

Spectral and reproducibility analysis of Hi-C contact maps
12 stars 7 forks source link

Important difference between Julia and Python implementations #8

Open srcoulombe opened 3 years ago

srcoulombe commented 3 years ago

Hi, I think I've found a pretty important difference between the Julia and Python versions of HiC-Spector.

The Julia implementation uses (a1,b1)=eigs(... ,nev=num_evec,which=:LM);; the returned subset of num_evec eigenvectors will therefore be chosen as those associated with the eigenvalues of largest magnitude (see the docs). However, the Python implementation uses the keyword "SM" when calling eigsh: a1, b1=eigsh(...,k=num_evec,which="SM"). eigsh therefore returns the eigenvectors associated with the smallest eigenvalues (docs).

The latter method seems to be the one discussed in the Bioinformatics paper. Unless I'm missing something, this is a pretty significant difference and should be fixed quickly.

quantum-man commented 3 years ago

This is because the Julia version is getting the eigenvectors corresponding to the largest eigenvalues of I-L, whereas the python version is getting the eigenvectors corresponding to the smallest eigenvalues of L. They are actually the same set of eigenvectors. There's nothing wrong.

On Sun, Mar 14, 2021 at 4:25 PM srcoulombe @.***> wrote:

Hi, I think I've found a pretty important difference between the Julia and Python versions of HiC-Spector.

The Julia implementation uses (a1,b1)=eigs(... ,nev=num_evec,which=:LM);; the returned subset of num_evec eigenvectors will therefore be chosen as those associated with the eigenvalues of largest magnitude (see the docs https://julialinearalgebra.github.io/Arpack.jl/latest/index.html#Arpack.eigs-Tuple%7BAny%7D). However, the Python implementation uses the keyword "SM" when calling eigsh: a1, b1=eigsh(...,k=num_evec,which="SM"). eigsh therefore returns the eigenvectors associated with the smallest eigenvalues (docs https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.eigsh.html ).

The latter method seems to be the one discussed in the Bioinformatics paper https://academic.oup.com/bioinformatics/article/33/14/2199/3078603. Unless I'm missing something, this is a pretty significant difference and should be fixed quickly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gersteinlab/HiC-spector/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZKLPE5E5LG7UJASIHXOVLTDUSVXANCNFSM4ZFMLAPA .

srcoulombe commented 3 years ago

My apologies, what I wrote was not clear on the issue. I agree with your point on the eigendecomposition of I-L with :LM vs that of L using :SM; If all possible eigenpairs were computed, then both approaches would return the same set of eigenpairs. But my point was about how the "SM"/"LM" (:SM/:LM) parameter affects which r eigenpairs are kept.

The article uses the convention of ordering eigenvalues in a non-decreasing order, and states that:

  1. the leading eigenvectors are more important than the lower ranked eigenvectors.

  2. the high-order eigenvectors are essentially noise terms, whereas the signal is stored in the leading vectors.

I honestly don't have a firm grasp on the intuition behind these statements, so I'll take them at face value (I'd be grateful if you could link to a reference to help me understand). But my interpretation of these two points is that the r eigenvectors associated with the r eigenvalues of smallest magnitude should be computed and sorted in non-decreasing order according to the magnitude of their corresponding eigenvalue. The sorted eigenvectors are then used in the computation of the distance metric.

However, this does not seem to be the case in the current Julia implementation; using eigs(speye(length(i_nz1))-Ln1_nz1,nev=num_evec,which=:LM) returns the num_evec eigenpairs of (I - L)'s spectra with the largest magnitudes and in non-increasing order. If my understanding of point #2. is correct, this means that the downstream computation of the distance is based off of noise rather than signal.

Here's some code to give an example in case my writing was still confusing. I should also emphasize that I'm using a more recent version of Julia than the ones that have been used for testing.

Julia Lang (1.5.2):

using LinearAlgebra; 
using Arpack;

# set random seed for reproducibility
import Random;
Random.seed!(42);

# generate a real symmetric array randomly (as a substitute for the Laplacian matrix)
A = rand(5,5) .- 0.5;
A_symm = A + transpose(A);
(a,b) = eigs(A_symm, nev=4, which=:SM)
(c,d) = eigs(LinearAlgebra.I(5) - A_symm, nev=4, which=:LM)
julia> A
5×5 Array{Float64,2}:
  0.033183    0.473566    0.437466   -0.116509     0.11205
 -0.0459709  -0.19613    -0.339994    0.0929116   -0.289115
 -0.482313   -0.323091   -0.0770437   0.245181    -0.332831
 -0.327067    0.456916    0.102298   -0.237191    -0.00291872
  0.458926    0.0842841  -0.136542    0.00295211   0.469432

julia> a
4-element Array{Float64,1}:
  0.049761059533001264
 -0.19586381502903097
  0.3921556706756817
  1.4062548361427225

julia> b
5×4 Array{Float64,2}:
  0.66291   0.467715    0.191037  -0.435195
 -0.129192  0.236879    0.743921  -0.0888448
  0.125068  0.654434   -0.504996   0.328193
 -0.613646  0.53921     0.171719   0.137297
 -0.389431  0.0780993  -0.354369  -0.822284

julia> c
4-element Array{Float64,1}:
 2.6678084094664634
 1.1958638150290308
 0.9502389404669987
 0.6078443293243183

julia> d
5×4 Array{Float64,2}:
  0.340446  0.467715   -0.66291   -0.191037
 -0.604885  0.236879    0.129192  -0.743921
 -0.439707  0.654434   -0.125068   0.504996
  0.533247  0.53921     0.613646  -0.171719
 -0.201287  0.0780993   0.389431   0.354369
srcoulombe commented 3 years ago

Here's the same example using Python

In Python (3.8.5)

import numpy as np
from scipy.sparse.linalg import eigsh
A = np.array([  [0.066366, 0.427595, -0.0448469, -0.443576, 0.570976],
                [0.427595, -0.392261, -0.663085, 0.549827, -0.204831],
                [-0.0448469, -0.663085, -0.154087, 0.347479, -0.469372],
                [-0.443576, 0.549827, 0.347479, -0.474382, 3.33882e-5],
                [0.570976, -0.204831, -0.469372, 3.33882e-5, 0.938864]])
a,b = eigsh(A, k=4, which="SM")
c,d = eigsh(np.eye(5)-A, k=4, which="LM")
>>> a
array([-0.19586357,  0.04976102,  0.39215533,  1.4062552 ])

>>> b
array([[-0.46771575,  0.66290901,  0.19103711,  0.43519556],
       [-0.23687922, -0.12919262,  0.74392089,  0.08884466],
       [-0.65443401,  0.1250675 , -0.50499681, -0.32819304],
       [-0.53920954, -0.613647  ,  0.17171874, -0.137297  ],
       [-0.07809869, -0.38943084, -0.35436907,  0.8222836 ]])

>>> c
array([0.60784467, 0.95023898, 1.19586357, 2.66780798])

>>> d
array([[ 0.19103711, -0.66290901,  0.46771575,  0.34044569],
       [ 0.74392089,  0.12919262,  0.23687922, -0.60488498],
       [-0.50499681, -0.1250675 ,  0.65443401, -0.43970649],
       [ 0.17171874,  0.613647  ,  0.53920954,  0.53324725],
       [-0.35436907,  0.38943084,  0.07809869, -0.20128702]])
srcoulombe commented 3 years ago

Ok the Python code also seems to have an issue with the ordering of the r eigenpairs. The example above shows that eigsh(A, k=4, which="SM") does not order the eigenpairs according to the definition in the paper. This could be problematic, since it could lead to a mismatch in the eigenvector pairing used in computing the distance metric.

This comment made me double-check the documentation for scipy.sparse.linalg.eigsh, and it seems that using the keyword argument return_eigenvectors=False is needed to guarantee the eigenvalues are returned in the expected order. But since both eigenvalues and eigenvectors are needed, something like the following could be used to satisfy the ordering requirement:

a,b = eigsh(A, k=4, which="SM")
sorted_indices = np.argsort(abs(a))
a = a[sorted_indices]
b = b[:,sorted_indices]

Keeping with the same example, this returns

>>> a
array([ 0.04976102, -0.19586357,  0.39215533,  1.4062552 ])

>>> b
array([[-0.66290901,  0.46771575, -0.19103711, -0.43519556],
       [ 0.12919262,  0.23687922, -0.74392089, -0.08884466],
       [-0.1250675 ,  0.65443401,  0.50499681,  0.32819304],
       [ 0.613647  ,  0.53920954, -0.17171874,  0.137297  ],
       [ 0.38943084,  0.07809869,  0.35436907, -0.8222836 ]])
quantum-man commented 3 years ago

I think I see your point. Actually, the matrix we used is more precisely called the normalized Laplacian, with a nice property that the eigenvalues \mu satisfy 0 = μ0 ≤ … ≤ μ{n−1} ≤ 2.  Without this property, like the matrix A above, there will be a high chance of messing up, because of those negative eigenvalues and LM, SM refer to the absolute magnitude of the eigenvalue. But even for the normalized Laplacian, you are right. I realize there are still chances that the 2 sets are not entirely consistent. Suppose \mu_0=0, \mu_1=0.1...\mu_n=1.95, while the leading eigenvalues (\mu_0,\mu_1,...) will be found by looking at the largest eigenvalues of the matrix I-L, the eigenvalue \mu_n will also be caught because 1-\mu_u=-0.95, which is pretty large in absolute value. I believe it doesn't happen often in real data, but I agree it's an issue. Since the Python code is more consistent with what's written on the paper, let's stick with the python code right now and I will go over the Julia code, which is nevertheless quite outdated. I don't quite recall why did the Julia code employ such a trick, perhaps in the early version of the eig function there was an issue with "SM". Actually, the Julia code came before the python code.  For the question of why the leading eigenvalues are used, we have described in the Supplementary Materials of the paper. It is related to random walks. Briefly, the 0 eigenvalue, corresponds to the steady-state, and the small ones, correspond to the slow decay modes that include a mesoscopic amount of nodes. The large eigenvalues correspond to the rapid decay, with very few nodes, kind of like a high-frequency noise term. All these are captured by the localization of those eigenvectors. Thanks for pointing out an issue.

srcoulombe commented 3 years ago

Aaah yes, my random array example was poorly chosen. In that case, the ordering issue in the Python code is moot.

srcoulombe commented 3 years ago

I have another question that's unrelated to the difference between the Julia and Python implementation, so I'll put that in a separate issue.