deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
231 stars 70 forks source link

hicMergeMatrixBins changes the chr size #787

Open cgirardot opened 2 years ago

cgirardot commented 2 years ago

I just noticed that (raw) matrices merged with hicMergeMatrixBins (v 3.7.2) have altered chr size. For example:

The initial 1K bin matrix has:

 hicInfo -m HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.2
File:   HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5
Size:   144,916
Bin_length: 1000
Sum of matrix:  211759338.0
Chromosomes:length: chr2L: 23513712 bp; chr2R: 25286936 bp; chr3L: 28110227 bp; chr3R: 32079331 bp; chr4: 1348131 bp; chrM: 19524 bp; <CUT for clarity>

After hicMergeMatrixBins-3.7.2 -o HiC_2-4h_R1_10K_3.7.2.h5 -nb 10 -m HiC_2-4h_R1_all-rep-merged.1Kb-bin-matrix.h5 :

hicInfo -m HiC_2-4h_R1_10K_3.7.2.h5 
# Matrix information file. Created with HiCExplorer's hicInfo version 3.7.2
File:   HiC_2-4h_R1_10K_3.7.2.h5
Size:   14,143
Bin_length: 10000
Sum of matrix:  210935523.0
Chromosomes:length: chr2L: 23510000 bp; chr2R: 25286936 bp; chr3L: 28110000 bp; chr3R: 32079331 bp; chr4: 1348131 bp; chrM: 19524 bp; <CUT for clarity>

Notice the chr2L, chr3L. I fear this can lead to issues later

sebastian-gregoricchio commented 2 years ago

I confirm that this issue give rise to problems in comparing matrices (hicCompareMatrices) that have been generated by bin merging.

(version 3.7.2)

cgirardot commented 2 years ago

@sebastian-gregoricchio I was indeed suspecting (had this pb before I am pretty sure). How did you solve this?

sebastian-gregoricchio commented 2 years ago

@cgirardot Actually I had to regenerate the matrix from the beginning directly we the desired resolution. before I was generating a 5kb matrix, and then merging the bins to get 20kb, 40kb, 100kb matrices. I wanted to subtract matrices from 2 different conditions at 40kb and was not working. So I generated directly the 40kb matrices and then it was working.

cgirardot commented 2 years ago

@sebastian-gregoricchio I see. Thx. I would have tried maybe to dump it and re-create it with cooler.

LeilyR commented 2 years ago

I believe that comes from some rounding in the last bin. I do not get how had issue downstream though. Can you elaborate a bit on that?

cgirardot commented 2 years ago

sorry I dont have a concrete example to provide. It is anyway weird that some chr are rounded and not others (see initial post). I think this should not happen and be fixed if possible

LeilyR commented 2 years ago

I labeled it, so we will have a deeper look at it.

sebastian-gregoricchio commented 2 years ago

I believe that comes from some rounding in the last bin. I do not get how had issue downstream though. Can you elaborate a bit on that?

Actually I do not have an exact message right now, but I did the following steps:

When doing the last step for the 40kb resolution matrices (obtained by bin merging), hicMatricesCompare returned an error like: The size of the chromosomes in file A differs from chr sizes in file B.

When instead I start doing directly a matrix at 40kb resolution from step 1 everything works fine

cgirardot commented 2 years ago

this sounds very familiar. I might even have mentioned this in a previous issue.

KaurKaram commented 1 year ago

I am also getting the same error The two matrices have different chromosome order. Use the tool hicAdjustMatrix to change the order. Merge1.cool: odict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT']) Merge4.cool: odict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'])

What is the solution for this

lldelisle commented 1 year ago

Hi, If you are working with cool files, my solution would be to re create the cool file, here are the lines:

bin=40
chromSizeFile="yourgenome.fa.fai"
inputMatrix="whatever.cool"
outputMatrix="whatever_fixed.cool"
cooler dump --join ${inputMatrix}  | cooler load --format bg2 "${chromSizeFile}:${bin}000" - ${outputMatrix}
u-n-i-v-e-r-z commented 1 year ago

Hi everyone,

It seems that the issue for chromosome length changing after lowering the resolution (merging bins) comes from these lines (248-249) from HicMergeMatrixBins.py

if count < num_bins / 2: log.debug("{} has few bins ({}). Skipping it\n".format(prev_ref, count))

It appears when reaching the end bins of a given chromosome. If the number of remaining bins is lower than half the desired number of bins to merge, it will simply discard those bins and the end of the chromosome will become the end of the last merged bins.

For example, let's say you have a 1kb matrix and you want to obtain a 25kb resolution one.

The iterations will perform binning/merging for the first 600 x 25kb bins (=15,000,000 bp). Once it reached this point, it will try to merge the remaining 8kbp (so 8 bins since the initial resolution is 1kb) but since 8 is < num_bins/2 (= 25/2 = 12,5), it will discard the remaining bins. In the end, your chromosome length would be the last bin end you have, ie 15,000,000.

@LeilyR @lldelisle Let's say one wants to keep all the bins, do you think it's safe to bypass this discarding filter and use all the remaining bins even if it's less than num_bins/2 ? If so could it be made as an option --keepLastBin ?

Best,

A