Closed biozzq closed 3 years ago
Hi,
only in the way it is mentioned in the warning text.
Best,
Joachim
Dear @joachimwolff
Sorry, I think above description is not clear. You can see the input I used is an uncorrected cool file, and it was built at 50K bin size.
hicInfo -m FU-1_50000.cool
# Matrix information file. Created with HiCExplorer's hicInfo version 3.6
File: FU-1_50000.cool
Date: 2021-04-07T14:53:06.525815
Genome assembly: unknown
Size: 48,715
Bin_length: 50000
Chromosomes:length: ...
Number of chromosomes: 21
Non-zero elements: 20,195,544
The following columns are available: ['chrom' 'start' 'end']
Generated by: HiCMatrix-15
Cooler library version: cooler-0.8.10
HiCMatrix url: https://github.com/deeptools/HiCMatrix
Finally, i just want to confirm that the input of hicMergeMatrixBins
must be an uncorrected matrix built at restriction site resolution using hicBuildMatrix, is this right?
Best, Zheng zhuqing
input I used is an uncorrected cool file, and it was built at 50K bin size
For cool files, we assume if a row or a column is filled with zeros only, it is a NaN region. Based on this, hicMergeMatrixBins assumes it is a corrected matrix because NaN bins should appear only after a correction. But using cool files and sparser data, it might happen it is interpreted in the wrong way.
Finally, i just want to confirm that the input of hicMergeMatrixBins must be an uncorrected matrix built at restriction site resolution using hicBuildMatrix, is this right?
You can use uncorrected or corrected; the warning states that the correction factors are removed for a corrected matrix because they are not valid anymore. However, before the merge is applied, the data is loaded, and correction factors (if present) are applied. This means the corrected interaction values are merged, and the new matrix on a different resolution does not have the correction factors present anymore. Moreover, by merging corrected values, the correction properties (matrix balancing properties) are not valid anymore. It should not affect the results that much, but we print the warning to make the user aware of it and that the user can weigh this information on their own. Restriction cut site matrix: You can use hicMergeMatrixBins on a restriction cut site resolution matrix or a fixed bin-size one. Most Hi-C experiments do not have the read coverage to create a useful restriction cut site resolution matrix; therefore, we advise using bin-sized matrices.
Dear @joachimwolff
Yes, as you suggested, it's better to use the bin-sized matrices. Also, it's better to use the raw matrices, because when merging corrected interaction, the balancing properties of the matrix will change although it should not affect the downstream analysis that much.
More, the cool file may cause problem, thus I changed to use the h5 file, however, I found the following two pipelines are different. Note that, the validpair reads were generated by HiC-Pro. Pipeline 1: here, i think I need add the option "--correction_division" in step 4, is this right?
1. build_matrix --binsize 200000 --chrsizes $chrsize --ifile ${pre}.allValidPairs --oprefix ${pre}_200000 --progress
2. hicConvertFormat --inputFormat hicpro --outputFormat cool -m ${pre}_200000.matrix -bf ${pre}_200000_abs.bed -o ${pre}_200000.cool
3. hicCorrectMatrix correct --correctionMethod KR --matrix ${pre}_200000.cool --chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chrX --outFileName ${pre}_200000_KR_corrected.cool --perchr
4. hicConvertFormat -m ${pre}_200000_KR_corrected.cool --outFileName ${pre}_1.h5 --inputFormat cool --outputFormat h5 --correction_division
5. hicInfo -m ${pre}_1.h5 (here I just attached the basic information)
Size: 11,967
Bin_length: 200000
Sum of matrix: 364402449256.7308
Non-zero elements: 22,213,814
Minimum (non zero): 0.19032179296450497
Maximum: 113491489.74507481
NaN bins: 0
Pipeline 2:
1. build_matrix --binsize 50000 --chrsizes $chrsize --ifile ${pre}.allValidPairs --oprefix ${pre}_50000 --progress
2. hicConvertFormat --inputFormat hicpro --outputFormat h5 -m ${pre}_50000.matrix -bf ${pre}_50000_abs.bed -o myMatrix.h5
3. hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb4.h5 -nb 4
3. hicCorrectMatrix correct --correctionMethod KR --matrix myMatrix_merged_nb4.h5 --chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chrX --outFileName myMatrix_merged_nb4.KR.h5 --perchr
4. hicInfo -m myMatrix_merged_nb4.KR.h5
Size: 11,959
Bin_length: 200000
Sum of matrix: 40488933.31984645
Non-zero elements: 6,128,691
Minimum (non zero): 0.1922369922395801
Maximum: 7138.557963002519
NaN bins: 0
In my mind, the above two pipeline should be same, but not here. Meanwhile, when using the corrected h5 file generated by pipeline2 in the following analysis, such as hicPCA
, it will give me following error;
ERROR:hicmatrix.HiCMatrix:Index error
Traceback (most recent call last):
File "HiCMatrix.py", line 269, in getRegionBinRange
startbin = sorted(self.interval_trees[chrname][startpos:startpos + 1])[0].data
IndexError: list index out of range
Traceback (most recent call last):
File "hicPCA", line 7, in <module>
main()
File "hicPCA.py", line 337, in main
vecs_list = correlateEigenvectorWithGeneTrack(ma, vecs_list, args.extraTrack)
File "hicPCA.py", line 161, in correlateEigenvectorWithGeneTrack
gene_occurrence[bin_id[1]] += 1
TypeError: 'NoneType' object is not subscriptable
Thank you.
Best, Zheng zhuqing
Dear @joachimwolff
Yes, as you suggested, it's better to use the bin-sized matrices. Also, it's better to use the raw matrices, because when merging corrected interaction, the balancing properties of the matrix will change although it should not affect the downstream analysis that much.
More, the cool file may cause problem, thus I changed to use the h5 file, however, I found the following two pipelines are different. Note that, the validpair reads were generated by HiC-Pro.
Why should the cool file cause an issue? Both are natively supported by HiCExplorer.
Pipeline 1: here, i think I need add the option "--correction_division" in step 4, is this right?
1. build_matrix --binsize 200000 --chrsizes $chrsize --ifile ${pre}.allValidPairs --oprefix ${pre}_200000 --progress 2. hicConvertFormat --inputFormat hicpro --outputFormat cool -m ${pre}_200000.matrix -bf ${pre}_200000_abs.bed -o ${pre}_200000.cool 3. hicCorrectMatrix correct --correctionMethod KR --matrix ${pre}_200000.cool --chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chrX --outFileName ${pre}_200000_KR_corrected.cool --perchr 4. hicConvertFormat -m ${pre}_200000_KR_corrected.cool --outFileName ${pre}_1.h5 --inputFormat cool --outputFormat h5 --correction_division 5. hicInfo -m ${pre}_1.h5 (here I just attached the basic information) Size: 11,967 Bin_length: 200000 Sum of matrix: 364402449256.7308 Non-zero elements: 22,213,814 Minimum (non zero): 0.19032179296450497 Maximum: 113491489.74507481 NaN bins: 0
--correction_division
is wrong here, if you don't know why and how to use this parameter, it is better not to use it and rely on the default behavior. Also please consider the output of hicInfo
in this case: the maximum interaction height is 113491489.74507481
(aka 113 million) and the sum of all contacts is 364402449256.7308
(aka 364 billion) not that realistic, right? If you remove the --correction_division
parameter you will get values in a regular and to be expected value range.
Pipeline 2:
1. build_matrix --binsize 50000 --chrsizes $chrsize --ifile ${pre}.allValidPairs --oprefix ${pre}_50000 --progress 2. hicConvertFormat --inputFormat hicpro --outputFormat h5 -m ${pre}_50000.matrix -bf ${pre}_50000_abs.bed -o myMatrix.h5 3. hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb4.h5 -nb 4 3. hicCorrectMatrix correct --correctionMethod KR --matrix myMatrix_merged_nb4.h5 --chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chrX --outFileName myMatrix_merged_nb4.KR.h5 --perchr 4. hicInfo -m myMatrix_merged_nb4.KR.h5 Size: 11,959 Bin_length: 200000 Sum of matrix: 40488933.31984645 Non-zero elements: 6,128,691 Minimum (non zero): 0.1922369922395801 Maximum: 7138.557963002519 NaN bins: 0
As it is stated in our documentation, --correctionMethod KR
is a parameter only available for cool
files. https://hicexplorer.readthedocs.io/en/latest/content/tools/hicConvertFormat.html#Optional%20arguments To use it for h5 input will not have any effect.
In my mind, the above two pipeline should be same, but not here. Meanwhile, when using the corrected h5 file generated by pipeline2 in the following analysis, such as
hicPCA
, it will give me following error;
That's because of the wrong --correction_division
parameter. Please consider it might be the case they are after this still not 100% identical. Floating accuracy issues can cause slight differences.
ERROR:hicmatrix.HiCMatrix:Index error Traceback (most recent call last): File "HiCMatrix.py", line 269, in getRegionBinRange startbin = sorted(self.interval_trees[chrname][startpos:startpos + 1])[0].data IndexError: list index out of range Traceback (most recent call last): File "hicPCA", line 7, in <module> main() File "hicPCA.py", line 337, in main vecs_list = correlateEigenvectorWithGeneTrack(ma, vecs_list, args.extraTrack) File "hicPCA.py", line 161, in correlateEigenvectorWithGeneTrack gene_occurrence[bin_id[1]] += 1 TypeError: 'NoneType' object is not subscriptable
Without the command and some information about the data I cannot make any statement.
Best,
Joachim
--correction_division is wrong here, if you don't know why and how to use this parameter, it is better not to use it and rely on the default behavior. Also please consider the output of hicInfo in this case: the maximum interaction height is 113491489.74507481 (aka 113 million) and the sum of all contacts is 364402449256.7308 (aka 364 billion) not that realistic, right? If you remove the --correction_division parameter you will get values in a regular and to be expected value range.
After removing --correction_division parameter, the result is as following. From the interaction values, i think these values are the corrected values, because they are not integer.
Size: 11,967
Bin_length: 200000
Sum of matrix: 7547.083099245842
Non-zero elements: 22,213,814
Minimum (non zero): 3.073680381911719e-05
Maximum: 12.408349245109566
NaN bins: 0
As it is stated in our documentation, --correctionMethod KR is a parameter only available for cool files. https://hicexplorer.readthedocs.io/en/latest/content/tools/hicConvertFormat.html#Optional%20arguments To use it for h5 input will not have any effect.
From here (https://hicexplorer.readthedocs.io/en/latest/content/tools/hicCorrectMatrix.html#hiccorrectmatrix), I could not find any comments about hicCorrectMatrix correct --correctionMethod KR
is a parameter only available for cool files but not for h5 file. Sorry if I have missed someting.
Without the command and some information about the data I cannot make any statement.
hicMergeMatrixBins -m test.cool -o myMatrix.cool -nb 4 hicCorrectMatrix correct --correctionMethod KR --matrix myMatrix.cool --chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chrX --outFileName myMatrix.KR.cool --perchr hicPCA -m myMatrix.KR.cool -o 1_1_pca1.bw 1_1_pca2.bw --format bedgraph --extraTrack longest_transcript_sort.bed
When running above commands, it will give me above errors.
Best, Zheng zhuqing
From here (https://hicexplorer.readthedocs.io/en/latest/content/tools/hicCorrectMatrix.html#hiccorrectmatrix), I could not find any comments about hicCorrectMatrix correct --correctionMethod KR is a parameter only available for cool files but not for h5 file. Sorry if I have missed something.
hicCorrectMatrix != hicConvertMatrix, we try to not mention unnecessary implementation details, therefore it does not matter for the user of hicCorrectMatrix how the data is stored in the files; however in hicConvertMatrix it is covered because it might be necessary for some users.
hicPCA -m myMatrix.KR.cool -o 1_1_pca1.bw 1_1_pca2.bw --format bedgraph --extraTrack longest_transcript_sort.bed
Does longest_transcript_sort.bed
contain only the identical set of chromosomes as the interaction matrix? If not, please change this. Do the chromosome names follow the same standard (UCSC vs ensemble annotation?)
The output: You define for the files a bigwig 1_1_pca1.bw
, but declare with the format argument it should be a bedgraph?!
Dear @joachimwolff
Does longest_transcript_sort.bed contain only the identical set of chromosomes as the interaction matrix? If not, please change this. Do the chromosome names follow the same standard (UCSC vs ensemble annotation?) The output: You define for the files a bigwig 1_1_pca1.bw, but declare with the format argument it should be a bedgraph?!
Yes, the longest_transcript_sort.bed
contains the identical set of chromosomes as the myMatrix.KR.cool
.
$cut -f 1 longest_transcript_sort.bed | sort | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
When I remove --extraTrack
option, hicPCA -m myMatrix.KR.cool -o 1_1_pca1.bw 1_1_pca2.bw --format bedgraph
can run successfully. Thus, I have compared the merged balanced matrix (namely myMatrix.KR.cool
) and the unmerged balanced matrix, I found that the length of chromosomes changed in myMatrix.KR.cool
, such as the length of chr1 is 274330532 bp in the unmerged matrix, but it is 274300000 bp in myMatrix.KR.cool
. Thus, the positions of any genes in the longest_transcript_sort.bed
beyond 274300000 bp may cause above errors. Is this right?
Best, Zheng zhuqing
Dear all,
I found that
hicMergeMatrixBins
will give me following waring when operating on an uncorrected matrix built at specific resolution. Does this affect the result?WARNING:hicexplorer.hicMergeMatrixBins:*WARNING*: The matrix is probably a corrected matrix that contains NaN bins. This bins can not be merged and are removed. It is preferable to first merge bins in a uncorrected matrix and then correct the matrix. Correction factors, if present, are removed as well.
Sincerely, Zheng zhuqing