Closed mxw010 closed 3 years ago
Hi,
Thanks for reporting this and providing a patch for this issue. I will consider it in the next release of HiCExplorer.
Best,
Joachim
thank @mxw010 for providing a work around, i temporarily solved this problem by adding this skip line
edit: sorry, but i found the problem still exists
@mxw010 What reference genome is this? hg19 is bigger, and hg38 matches for the most chromosomes, but chromosomes 4 and X are not correct. (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes and https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/#/st)
For example hg19: For chromosome 1 you have 248956422
vs. how it should be: 249250621
hg38: For chromosome 4 you have 190000000
vs. how it should be: 190,214,555
; same for chrX: You have 156000000
but it should be 156,040,895
.
Can it be you created your matrix without a correct chromosome sizes file?
@TTTPOB Please provide more information why it fails for you. What reference genome do you use, what are your chromosome sizes, have you used a chromosome size file to create the matrix, and what exact data you use for the correlation?
(sorry for not providing needed information in time, I am busy with some other project these days
@joachimwolff I did map to hg38, and the interaction matrix were binned with cooler/cooltools and then converted to h5 with HiCExplorer. The problem is not that the genome size file was wrong (I just checked: my reference file and genome.size were both correct), but hicPCA infers genome size from the interaction matrix. If, for one reason or another, that information isn't in the interaction matrix, or some bins were removed for some reason (in my data), then we would run into that specific problem. In retrospect, perhaps adding a warning is simply enough for us to know where it went wrong.
the genome.size I used in creating interaction matrix were:
chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259 chr6 170805979 chr7 159345973 chr8 145138636 chr9 138394717 chr10 133797422 chr11 135086622 chr12 133275309 chr13 114364328 chr14 107043718 chr15 101991189 chr16 90338345 chr17 83257441 chr18 80373285 chr19 58617616 chr20 64444167 chr21 46709983 chr22 50818468 chrX 156040895 chrY 57227415
As mentioned in #716, when correcting for the signs for eigenvectors using a gene track, hicPCA back calculates size of the genome based on bins sometime, instead of the actual size of the chromosomes. For example, in my data, when hicPCA loads the interaction matrix, the size of the chromosomes are:
This is fine for most chromosomes, but chr4 and chrX both have genes out of the range of the chromosomes, resulting in the same error as seen in #716:
Imo, this can simply be fixed by skipping lines from the gene reference file that are out of range, in function correlateEigenvectorWithGeneTrack:
This shouldn't have a big impact on the sign (unless there is a major problem), and would fix the bug.