ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
60 stars 10 forks source link

My input hic file contains chr1, chr2, chr3..... instead of 1, 2, 3..... and therefore shows error when I preprocess the hic files for running dchic #76

Closed asgda closed 1 year ago

asgda commented 1 year ago

My Error:

- These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chrchr1
 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chrchr1
 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chrchr1
 not found in the file.

I have 3 .hic files which I want to preprocess to run dchic. I check with the exmaples, and found that the hic files did not have chromosome names like chr1, chr2, chr3..... but just had numbers like 1,2,3........ but my hic file has chr1, chr2, chr3 lke structure. The code I am using is:

#!/bin/bash

#pip install hic-straw ##install hic-straw for the preprocessing

python preprocess.py -input hic -file diff_inter_30.ext.hic -res 100000 -prefix hicfile_diff
python preprocess.py -input hic -file trf2_si_inter_30.ext.hic -res 100000 -prefix hicfile_trf2si
python preprocess.py -input hic -file un_inter_30.ext.hic -res 100000 -prefix hicfile_un

Do we need to make some changes in the preprocess.py file? Please let me know about it at the earliest. We have a project at hand which we need to complete asap.

Thanks in advance!

ay-lab commented 1 year ago

Just to confirm, are you using the latest version of the code? There was an update pushed to this script about ~1 month ago that should have fixed this.

If not, could you print out chrom.name on line 131 (of the new script: the first line of the else statement)? I suspect the issue might be e.g. "Chr1" since chrom.name.lower() is not applied uniformly in that code chunk. If you do see something like that, can you change the line to chrTag = "chr" + chrom.name if "chr" not in chrom.name.lower() else chrom.name.lower() and let me know how that goes?

asgda commented 1 year ago

Thank you for the reply, sir.

I used the updated script and now the bed file formation is perfect but the matrix file formation is having issues. The error which I am getting after the using the updated preprocess.py file is:

- These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.

Next, I also edited the line 131 of the updated file, and now also I am getting the same error:

- These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.
 - These are the resolutions of your file: [2500000, 1000000, 500000, 250000, 100000, 50000, 25000, 10000, 5000, 2000, 1000, 500, 200, 100]
 - This is the genome of your file: hg38.
 - Removing these chromosomes:
  - All (.hic file artifact removed by default)
 - Creating bed file
 - Bed File Creation Done.
 - Processing These Chromosomes: 
  - chr1
1 not found in the file.

The code I am using:

python /home/cgntlab1/work/igib/hic_shuvra/G4_project/comp_analysis/dcHiC/utility/preprocess.py -input hic -file diff_inter_30.ext.hic -res 10000 -prefix hicfile_diff
python /home/cgntlab1/work/igib/hic_shuvra/G4_project/comp_analysis/dcHiC/utility/preprocess.py -input hic -file trf2_si_inter_30.ext.hic -res 10000 -prefix hicfile_trf2si
python /home/cgntlab1/work/igib/hic_shuvra/G4_project/comp_analysis/dcHiC/utility/preprocess.py -input hic -file un_inter_30.ext.hic -res 10000 -prefix hicfile_un

Can you please let me know what am I doing wrong here?

Thank you!

ay-lab commented 1 year ago

Ah, okay — I think I see the issue. Since the hi-c file is indexed with "chr1, chr2, ...", you'll want to change line 172 from chrNum = chr.split("chr")[1] to chrNum = chr. I think this should be all, but please let me know if there are other issues.

Also, just for my own curiosity as I update the script, could you also tell me what you get when you add print(chrom.name) before line 131?

asgda commented 1 year ago

Thank you so much for the solution, sir. It worked like a charm.

Also, print(chrom.name) before line 131 gives "All".