deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
227 stars 70 forks source link

UnicodeDecodeError When running hicConvertFormat to convert HiC to Cool format #821

Open ashishjain1988 opened 1 year ago

ashishjain1988 commented 1 year ago

Welcome to the HiCExplorer GitHub repository! Before opening the issue please check that the following requirements are met :

Retry your command, is it solved now? If not please continue with the following:

lldelisle commented 1 year ago

Hi, Would you mind to download the file we use in tests here and check if the following command works:

hicConvertFormat -m SRR1791297_30.hic --inputFormat hic --outputFormat cool -o test.cool
xiaohuli-45 commented 1 year ago

Hello, I have encountered the same problem. But SRR1791297_ 30. hic succeeded in hic2cool. My hic file is downloaded from ENCODE, could you help me? Thank you very much!

lldelisle commented 1 year ago

Would you mind to give the URL? Thanks

xiaohuli-45 commented 1 year ago

The URL is https://www.encodeproject.org/files/ENCFF080DPJ/@@download/ENCFF080DPJ.hic. Thanks

lldelisle commented 1 year ago

Hi, In fact this issue is the same as #798 : the format behind .hic changed and it seems that hic2tool is not updated (see https://github.com/4dn-dcic/hic2cool/issues/60).

lldelisle commented 1 year ago

Waiting for a better solution, this python script is working, using hicstraw (available on pip: https://pypi.org/project/hic-straw/) and cooler (https://cooler.readthedocs.io/en/latest/):


import numpy as np
import hicstraw
import os

hic_file = 'ENCFF080DPJ.hic'
cool_file = 'ENCFF080DPJ_250kb.cool'

data_type = 'observed' # (previous default / "main" data) or 'oe' (observed/expected)
normalization = "NONE"  # , VC, VC_SQRT, KR, SCALE, etc.
resolution = 250000

hic = hicstraw.HiCFile(hic_file)

assert resolution in hic.getResolutions(), \
    f"{resolution} is not part of the possible resolutions {','.join(hic.getResolutions())}"

# First write the chromosome sizes:
with open(hic.getGenomeID() + '.size', 'w') as fsize:
    for chrom in hic.getChromosomes():
        if chrom.name != "All":
            fsize.write(f"{chrom.name}\t{chrom.length}\n")
# Then write the counts in text file:
with open(cool_file.replace('.cool', ".txt"), 'w') as fo:
    for i in range(len(chrom_sizes)):
        for j in range(i, len(chrom_sizes)):
            chrom1 = chrom_sizes.index[i]
            chrom2 = chrom_sizes.index[j]
            result = hicstraw.straw(data_type, normalization, hic_file, chrom1, chrom2, 'BP', resolution)
            for k in range(len(result)):
                start1 = result[k].binX
                start2 = result[k].binY
                value = result[k].counts
                fo.write(f"{chrom1}\t{start1}\t{start1}\t{chrom2}\t{start2}\t{start2}\t{value}\n")

os.system(f"cooler load -f bg2 {hic.getGenomeID()}.size:{resolution} {cool_file.replace('.cool', '.txt')} {cool_file}")

The code above has a mistake, please use the one below.

xiaohuli-45 commented 1 year ago

Hi, I successfully converted the file format with your code. Thank you very much !

lldelisle commented 1 year ago

Glad it has been useful for someone. :wink:

LinearParadox commented 1 year ago

Hi,

I tried your code, however I'm getting chrom_sizes is not defined, as the variable does not seem to be declared anywhere.

lldelisle commented 1 year ago

Oups indeed... I tried to simplify but I did a mistake, here is the correct one:

import numpy as np
import hicstraw
import os
import pandas as pd

hic_file = 'ENCFF080DPJ.hic'
cool_file = 'ENCFF080DPJ_250kb.cool'

data_type = 'observed' # (previous default / "main" data) or 'oe' (observed/expected)
normalization = "NONE"  # , VC, VC_SQRT, KR, SCALE, etc.
resolution = 250000

hic = hicstraw.HiCFile(hic_file)

assert resolution in hic.getResolutions(), \
    f"{resolution} is not part of the possible resolutions {','.join(hic.getResolutions())}"

chrom_sizes = pd.Series({chrom.name: chrom.length for chrom in hic.getChromosomes() if chrom.name != "All"})

# First write the chromosome sizes:
with open(hic.getGenomeID() + '.size', 'w') as fsize:
    for chrom in hic.getChromosomes():
        if chrom.name != "All":
            fsize.write(f"{chrom.name}\t{chrom.length}\n")
# Then write the counts in text file:
with open(cool_file.replace('.cool', ".txt"), 'w') as fo:
    for i in range(len(chrom_sizes)):
        for j in range(i, len(chrom_sizes)):
            chrom1 = chrom_sizes.index[i]
            chrom2 = chrom_sizes.index[j]
            result = hicstraw.straw(data_type, normalization, hic_file, chrom1, chrom2, 'BP', resolution)
            for k in range(len(result)):
                start1 = result[k].binX
                start2 = result[k].binY
                value = result[k].counts
                fo.write(f"{chrom1}\t{start1}\t{start1}\t{chrom2}\t{start2}\t{start2}\t{value}\n")

os.system(f"cooler load -f bg2 {hic.getGenomeID()}.size:{resolution} {cool_file.replace('.cool', '.txt')} {cool_file}")
caragraduate commented 8 months ago

Hi there, thank you for providing the code above which can be successfully run in my case. But when I check the information inside the converted .cool file using 'hicInfo' command, it only has 'chrom', 'start', 'end' columns available. I did not see any 'weight' column or in my case, it should be 'SCALE' column.

Is it normal to see or do you have any advice to deal with this problem?

Many thanks!