deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Why does summing cis interaction using cooler give a different result from hicInfo (after hicAdjustMatrix-3.7 -iih) #744

Closed cgirardot closed 3 years ago

cgirardot commented 3 years ago

Hi @joachimwolff

I am sorry for the naive question but I can't see what I am missing here.

I used to count the number of cis interactions using simple commands like (here the in.h5 is a plain raw matrix):

hicConvertFormat --matrices in.h5 -o in.cool --inputFormat h5 --outputFormat cool
cooler dump --join in.cool | awk '$1==$4 {print $0}' > in.cisOnly.bedpe
cut -f 7 in.cisOnly.bedpe | awk 'END { print s } { s += $1 }'

For a particular example matrix, this gives me 26,297,808.

If I now use the new hicAdjustMatrix-3.7 --interIntraHandling inter option on the same input matrix and do:

hicAdjustMatrix-3.7 -a keep --interIntraHandling inter -m in.h5 -o in.cisOnly.h5 --chromosomes chr2L chr2R chr3L chr3R chr4 chrX
hicInfo -m in.cisOnly.h5

I get: Sum of matrix: 20753151.0

I have checked and I have the same list of chromosomes kept for both methods and also converted back my cis-only bedpe file to h5 with:

gzip in.cisOnly.bedpe
cooler load --assembly dm6 -f bg2 dm6.chrom.sizes.short.txt:50000 in.cisOnly.bedpe.gz TEST.cisOnly.cool
hicConvertFormat --matrices TEST.cisOnly.cool -o TEST.cisOnly.h5 --inputFormat cool --outputFormat h5 
hicInfo TEST.cisOnly.h5

I get: Sum of matrix: 20753151.0 ie the same as with the new hicAdjustMatrix-3.7 --interIntraHandling

Conclusion: there is obviously something wrong with summing up the 7th column of the bedpe but why is this is wrong?

Thank you for your help and sorry again if this is obvious

joachimwolff commented 3 years ago

Hi,

That one took me a while to figure out. You need to apply the parameter --balanced as written in the cooler dump documentation to apply the weights; otherwise, you get the raw data. Please note that in column 7, the raw data is written, and column 8 contains the corrected data. HiCExplorer always works with the correction factors applied, therefore the different numbers. (https://cooler.readthedocs.io/en/latest/cli.html#cmdoption-cooler-dump-b)

I will fix this in a future release.

Best,

Joachim

cgirardot commented 3 years ago

Hi @joachimwolff ,

thank you for looking into this.

The example I gave you actually uses an unmodified raw matrix (as produced by hicBuildMatrix) ie I am really summing up the raw counts which means that the observed difference is in fact matrix.diagonal().sum() / 2 and is most likely due to the bug you reported. I will therefore keep using my approach to compute the scaling factor between my matrices (derived from the sum of raw contact).

Best

cgirardot commented 3 years ago

Hi @joachimwolff I am actually wondering if this bug would also affect hicNormalize --normalize smallest ? thx

joachimwolff commented 3 years ago

No, in hicNormalize smallest we use the sum of the full matrix to compute the ratios. https://github.com/deeptools/HiCExplorer/blob/master/hicexplorer/hicNormalize.py#L71

joachimwolff commented 3 years ago

Bug fix in 3.7.1