Open leomorelli opened 2 years ago
Hi,
I think the issue is that the software consider that the interval:
chr1 710000 720000
both match to the bin chr1 710000 720000
and the bin chr1 720000 730000
, therefore it duplicates it.
Indeed trying with you first contacts, using hicConvertFormat I got:
$ cooler dump --join output.cool
chr1 710000 720000 chr1 750000 760000 17
chr1 720000 730000 chr1 760000 770000 17
chr1 940000 950000 chr1 990000 1000000 15
chr1 940000 950000 chr1 1180000 1190000 10
chr1 940000 950000 chr1 1250000 1260000 11
chr1 950000 960000 chr1 1000000 1010000 15
chr1 950000 960000 chr1 1190000 1200000 10
chr1 950000 960000 chr1 1260000 1270000 11
A solution waiting for this bug to be fixed is to use cooler:
$ cooler load -f bg2 hg19_chr_sizes.txt:10000 input.bed test.cool
$ cooler dump --join test.cool
chr1 710000 720000 chr1 750000 760000 17
chr1 940000 950000 chr1 990000 1000000 15
chr1 940000 950000 chr1 1180000 1190000 10
chr1 940000 950000 chr1 1250000 1260000 11
Hi, thank you so much, I would definitely give it a try!
@joachimwolff The issue is here: https://github.com/deeptools/HiCExplorer/blob/20be92139dcfc6b4b215926b7fd8f4c7f6f48165/hicexplorer/hicConvertFormat.py#L210-L213 I think you were not expecting to find 2 bins for a region and you may consider using end-1.
Hi there, I'm observing a weird behavior of the HicConvertFormat function, when I'm using a 2d-text bed file as input. Here it is the command:
hicConvertFormat -m input.bed -o output.cool --resolutions 10000 --chromosomeSizes hg19_chr_sizes.txt --inputFormat 2D-text --outputFormat cool
1. input .bed file characteristics
The input bed file contains
36526
number of connected pairs and it is formatted as follows : chr_a | start_a | end_a | chr_b | start_b | end_b | contacts This is the output ofhead input.bed
:2. output .cool file characteristics
The conversion proceeds without errors, but when I inspect the data, I obtain a dataframe of
68640
connected pairs, in which some spurious pairs are incorporated:c = cooler.Cooler(filepath)
c.pixels(join=True)[:]
3. set of bins
In order to understand if some pairs were also lost during the conversion I generated two set of bins:
bed_bins
adcool_bins
and I compared them:3.1 bed_bins
df_bed['start']=df_bed['chr1']+'-'+df_bed['s1'].astype(str)+'-'+df_bed['e1'].astype(str)
df_bed['end']=df_bed['chr2']+'-'+df_bed['s2'].astype(str)+'-'+df_bed['e2'].astype(str)
bed_bins=set(df_bed['start'])|set(df_bed['end'])
len(bed_bins)
3.2 cool_bins
df_cool=c.pixels(join=True)[:]
df_cool['start']=df_cool['chrom1'].astype(str)+'-'+df_cool['start1'].astype(str)+'-'+df_cool['end1'].astype(str)
df_cool['end']=df_cool['chrom2'].astype(str)+'-'+df_cool['start2'].astype(str)+'-'+df_cool['end2'].astype(str)
cool_bins=set(df_cool['start'])|set(df_cool['end'])
len(cool_bins)
3.3 comparison between sets
len(bed_bins-cool_bins)
len(cool_bins-bed_bins)
Any help you can provide would be greatly appreciated! Thanks a lot.