deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicConvertFormat 2D-text to cool generates spurious pairs #808

Open leomorelli opened 2 years ago

leomorelli commented 2 years ago

Hi there, I'm observing a weird behavior of the HicConvertFormat function, when I'm using a 2d-text bed file as input. Here it is the command: hicConvertFormat -m input.bed -o output.cool --resolutions 10000 --chromosomeSizes hg19_chr_sizes.txt --inputFormat 2D-text --outputFormat cool

1. input .bed file characteristics

The input bed file contains 36526 number of connected pairs and it is formatted as follows : chr_a | start_a | end_a | chr_b | start_b | end_b | contacts This is the output of head input.bed:

chr1 710000 720000 chr1 750000 760000 17 chr1 940000 950000 chr1 990000 1000000 15 chr1 940000 950000 chr1 1180000 1190000 10 chr1 940000 950000 chr1 1250000 1260000 11

2. output .cool file characteristics

The conversion proceeds without errors, but when I inspect the data, I obtain a dataframe of 68640 connected pairs, in which some spurious pairs are incorporated: c = cooler.Cooler(filepath) c.pixels(join=True)[:]

chrom1 | start1 | end1 | chrom2 | start2 | end2 | count chr1 710000 720000 chr1 750000 760000 17.0 chr1 720000 730000 chr1 760000 770000 17.0 ← spurious pair chr1 940000 950000 chr1 990000 1000000 15.0 chr1 940000 950000 chr1 1180000 1190000 10.0 chr1 940000 950000 chr1 1250000 1260000 11.0

3. set of bins

In order to understand if some pairs were also lost during the conversion I generated two set of bins: bed_bins ad cool_bins and I compared them:

3.1 bed_bins

df_bed['start']=df_bed['chr1']+'-'+df_bed['s1'].astype(str)+'-'+df_bed['e1'].astype(str) df_bed['end']=df_bed['chr2']+'-'+df_bed['s2'].astype(str)+'-'+df_bed['e2'].astype(str) bed_bins=set(df_bed['start'])|set(df_bed['end']) len(bed_bins)

26760

3.2 cool_bins

df_cool=c.pixels(join=True)[:] df_cool['start']=df_cool['chrom1'].astype(str)+'-'+df_cool['start1'].astype(str)+'-'+df_cool['end1'].astype(str) df_cool['end']=df_cool['chrom2'].astype(str)+'-'+df_cool['start2'].astype(str)+'-'+df_cool['end2'].astype(str) cool_bins=set(df_cool['start'])|set(df_cool['end']) len(cool_bins)

41159

3.3 comparison between sets

len(bed_bins-cool_bins)

0 ← 0 bin lost

len(cool_bins-bed_bins)

14399 ← newly generated bins

Any help you can provide would be greatly appreciated! Thanks a lot.

lldelisle commented 2 years ago

Hi, I think the issue is that the software consider that the interval: chr1 710000 720000 both match to the bin chr1 710000 720000 and the bin chr1 720000 730000, therefore it duplicates it. Indeed trying with you first contacts, using hicConvertFormat I got:

$ cooler dump --join output.cool 
chr1    710000  720000  chr1    750000  760000  17
chr1    720000  730000  chr1    760000  770000  17
chr1    940000  950000  chr1    990000  1000000 15
chr1    940000  950000  chr1    1180000 1190000 10
chr1    940000  950000  chr1    1250000 1260000 11
chr1    950000  960000  chr1    1000000 1010000 15
chr1    950000  960000  chr1    1190000 1200000 10
chr1    950000  960000  chr1    1260000 1270000 11

A solution waiting for this bug to be fixed is to use cooler:

$ cooler load -f bg2 hg19_chr_sizes.txt:10000 input.bed test.cool
$ cooler dump --join test.cool 
chr1    710000  720000  chr1    750000  760000  17
chr1    940000  950000  chr1    990000  1000000 15
chr1    940000  950000  chr1    1180000 1190000 10
chr1    940000  950000  chr1    1250000 1260000 11
leomorelli commented 2 years ago

Hi, thank you so much, I would definitely give it a try!

lldelisle commented 2 years ago

@joachimwolff The issue is here: https://github.com/deeptools/HiCExplorer/blob/20be92139dcfc6b4b215926b7fd8f4c7f6f48165/hicexplorer/hicConvertFormat.py#L210-L213 I think you were not expecting to find 2 bins for a region and you may consider using end-1.