deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
223 stars 68 forks source link

chicViewpoint hangs indefinitely when reference file is a BED file with (non-unique) gene names in 4th column #880

Open mtekman opened 7 months ago

mtekman commented 7 months ago

Main command

version: hicexplorer==3.7.3 (micromamba install)

chicViewpoint -m 3_hicmatrices/*_matrix.h5 \
                --averageContactBin 5 --range 1000000 1000000 \
                -rp <reference_points.bed> \ 
                -bmf 4_background_model/bg.txt \
                --outFileName 5_calc_all_interactions/all_interactions.ugenes.hdf5 --fixateRange 500000 --threads 30

Problem

If the -rp parameter is a BED file, it hangs if the 4th column is not unique:

e.g. A reference point bed file with these entries will hang indefinitely with 100% CPU for several days until killed.

11      108810997       108811117       Axin2
11      108919925       108920044       Axin2
16      45044224        45044344        Btla
16      45044616        45044736        Btla
8       107329854       107329974       Cdh1

e.g.2 A reference point bed file with these entries will run near instantly without problems:

11      108810997       108811117       Axin2-1
11      108919925       108920044       Axin2-2
16      45044224        45044344        Btla-1
16      45044616        45044736        Btla-2
8       107329854       107329974       Cdh1-1

(generated via awk '{gmap[$4] = gmap[$4] + 1; print $0"-"gmap[$4];}' <original_rp.bed>)

I think a small bit of text that mentions this in the help text of the --referencePoints parameter would be enough

Cheers!