ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
60 stars 10 forks source link

Small bug in utillities/preprocess.py + solution, 'chromosome not found in the file' error. #67

Closed lauhinojosa closed 1 year ago

lauhinojosa commented 1 year ago

Hello,

When using utilities/preprocess.py to format .hic files, I ran into the following issue:

Description of the Issue

python preprocess.py -input hic -file sample1.hic -res 1000 -prefix preprocessed/sample1 -removeChr chrm,chry

The output was:

- These are the resolutions of your file: [5000000, 2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000]
 - This is the genome of your file: mm10.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes:
  - chrchr1
not found in the file.

Additionally, the output bed file did not have the correct format.

chrchr1 0       1000    1
chrchr1 1000    2000    2
chrchr1 2000    3000    3
chrchr1 3000    4000    4
chrchr1 4000    5000    5
chrchr1 5000    6000    6
chrchr1 6000    7000    7
chrchr1 7000    8000    8
chrchr1 8000    9000    9
chrchr1 9000    10000   10

Solution

The incorrect bed format occurs because the script appends 'chr' to the chromosome name, expecting the chromosome names in the .hic file to be '1,2,3...etc' instead of 'chr1,chr2... etc'. I solved this by changing line 154 to:

bedfile.write(chrs[a] + "\t" + (str(posIterator-res)) + "\t" + str(posIterator) + "\t" + str(iterator) + "\n") #removed 'chr'

Likewise, the not found error occurs because the script uses the chromosome number (and not the chromosome name) in the function hicstraw.straw() in line 175. Or, in this case, the empty str that results from chrNum = chr.split("chr")[1] in line 173.

I changed line 175 to:

result = hicstraw.straw("observed", 'NONE', results.file, chr, chr, 'BP', res)

If anyone else is having this issue, these 2 changes suffice to solve it. A future version of this script should consider different naming schemes.

Thank you,

Laura Hinojosa

ay-lab commented 1 year ago

Thanks a lot for raising the issue and providing the solution.

tolender commented 1 year ago

I was still getting this error (chrchr1) after making these changes to the code at lines 154, 175. I found that changing line 131 resolved the problem:

chrTag = "chr" + chrom.name into chrTag = chrom.name