deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicPCA using Gene Track #719

Closed mxw010 closed 3 years ago

mxw010 commented 3 years ago

As mentioned in #716, when correcting for the signs for eigenvectors using a gene track, hicPCA back calculates size of the genome based on bins sometime, instead of the actual size of the chromosomes. For example, in my data, when hicPCA loads the interaction matrix, the size of the chromosomes are:

pMatrix.get_chromosome_sizes()   
OrderedDict([('chr1', 248956422), ('chr2', 242193529), ('chr3', 198295559), ('chr4', 190000000), ('chr5', 181538259), ('chr6', 170805979), ('chr7', 159345973), ('chr8', 145138636), ('chr9', 138394717), ('chr10', 133797422), ('chr11', 135086622), ('chr12', 133275309), ('chr13', 114364328), ('chr14', 107000000), ('chr15', 101991189), ('chr16', 90338345), ('chr17', 83257441), ('chr18', 80373285), ('chr19', 58617616), ('chr20', 64444167), ('chr21', 46709983), ('chr22', 50818468), ('chrX', 156000000)])

This is fine for most chromosomes, but chr4 and chrX both have genes out of the range of the chromosomes, resulting in the same error as seen in #716:

ERROR:hicmatrix.HiCMatrix:Index error
Traceback (most recent call last):
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicmatrix/HiCMatrix.py", line 270, in getRegionBinRange
    endbin = sorted(self.interval_trees[chrname][endpos:endpos + 1])[0].data
IndexError: list index out of range
Traceback (most recent call last):
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/bin/hicPCA", line 7, in <module>
    main()
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicexplorer/hicPCA.py", line 337, in main
    vecs_list = correlateEigenvectorWithGeneTrack(ma, vecs_list, args.extraTrack)
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicexplorer/hicPCA.py", line 161, in correlateEigenvectorWithGeneTrack
    gene_occurrence[bin_id[1]] += 1
TypeError: 'NoneType' object is not subscriptable

Imo, this can simply be fixed by skipping lines from the gene reference file that are out of range, in function correlateEigenvectorWithGeneTrack:

for interval in bed:
    chromosome_name = interval.chromosome
    if chromosome_name not in chromosome_list:
        continue
    #skip lines out of range:
    if interval.start > pMatrix.get_chromosome_sizes()[chromosome_name]:
        continue
    # in which bin of the Hi-C matrix is the given gene?
    bin_id = pMatrix.getRegionBinRange(interval.chromosome,
                                   interval.start, interval.end)
    # add +1 for one gene occurrence in this bin
    gene_occurrence[bin_id[1]] += 1

This shouldn't have a big impact on the sign (unless there is a major problem), and would fix the bug.

joachimwolff commented 3 years ago

Hi,

Thanks for reporting this and providing a patch for this issue. I will consider it in the next release of HiCExplorer.

Best,

Joachim

TTTPOB commented 3 years ago

thank @mxw010 for providing a work around, i temporarily solved this problem by adding this skip line

edit: sorry, but i found the problem still exists

joachimwolff commented 3 years ago

@mxw010 What reference genome is this? hg19 is bigger, and hg38 matches for the most chromosomes, but chromosomes 4 and X are not correct. (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes and https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/#/st)

For example hg19: For chromosome 1 you have 248956422 vs. how it should be: 249250621 hg38: For chromosome 4 you have 190000000 vs. how it should be: 190,214,555; same for chrX: You have 156000000 but it should be 156,040,895. Can it be you created your matrix without a correct chromosome sizes file?

joachimwolff commented 3 years ago

@TTTPOB Please provide more information why it fails for you. What reference genome do you use, what are your chromosome sizes, have you used a chromosome size file to create the matrix, and what exact data you use for the correlation?

TTTPOB commented 3 years ago

(sorry for not providing needed information in time, I am busy with some other project these days

mxw010 commented 3 years ago

@joachimwolff I did map to hg38, and the interaction matrix were binned with cooler/cooltools and then converted to h5 with HiCExplorer. The problem is not that the genome size file was wrong (I just checked: my reference file and genome.size were both correct), but hicPCA infers genome size from the interaction matrix. If, for one reason or another, that information isn't in the interaction matrix, or some bins were removed for some reason (in my data), then we would run into that specific problem. In retrospect, perhaps adding a warning is simply enough for us to know where it went wrong.

the genome.size I used in creating interaction matrix were:

chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259 chr6 170805979 chr7 159345973 chr8 145138636 chr9 138394717 chr10 133797422 chr11 135086622 chr12 133275309 chr13 114364328 chr14 107043718 chr15 101991189 chr16 90338345 chr17 83257441 chr18 80373285 chr19 58617616 chr20 64444167 chr21 46709983 chr22 50818468 chrX 156040895 chrY 57227415