deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
231 stars 70 forks source link

Clarifications of parameters #681

Closed gdolsten closed 3 years ago

gdolsten commented 3 years ago

Hi, I am trying to detect loops using hicExplorer but I am having some difficulty understanding the parameters I am trying to optimize. Can you clarify this one:

--peakInteractionsThreshold, -pit
The minimum number of interactions a detected peaks needs to have to be considered (Default: 10).

Does this mean the number of reads in a bin? What does "number of interactions" mean exactly?

Also, What is the difference between :

--pValuePreselection, -pp
Only candidates with p-values less the given threshold will be considered as candidates. For each genomic distance a negative binomial distribution is fitted and for each pixel a p-value given by the cumulative density function is given. This does NOT influence the p-value for the neighborhood testing. Can a single value or a threshold file created by hicCreateThresholdFile (Default: 0.1).

and

--pValue, -p
Rejection level for Anderson-Darling or Wilcoxon-rank sum test for H0. H0 is peak region and background have the same distribution (Default: 0.025).

Default: 0.025

And when should I modify one versus the other? Am I correct in understanding that the first p-value is for distance, and the second p-value is for the distribution of pixels within the peak?

joachimwolff commented 3 years ago

Hi,

--peakInteractionsThreshold defines the minimum number of interactions two regions need to have (aka the value in the corresponding bin) to consider this as a peak. The number of interactions is the number of counted interactions of two regions that fall into one bin. The idea is that a user can exclude a peak candidate with a low absolute interaction number.

--pValuePreselection defines the p-value (can be a single value or a file with the genomic distances and its per tab separated threshold value; see hicCreateThresholdFile) that an interaction needs to have to be considered as a peak candidate. We fit a distribution per genomic distance, and only if for this distribution the p-value of an interaction is lower or equal to the given threshold it is considered a peak candidate.

--pValue: Given a peak candidate, we 'draw' a square around it and define it as the peak region and a second, a greater square around it to define a neighborhood. Only if the peak vs. a) the horizontal neighborhood, b) vertical neighborhood and c) the bottom left corner is by the Wilcoxon-rank-sum test not considered from the same distribution, the peak candidate is accepted as a peak. The value --pValue is used for deciding if H0 is accepted or rejected.

Please check if you use the latest HiCExplorer version 3.6. With HiCExplorer 3.3 and 3.4, we had a different implementation of the loop detection implemented, which was less accurate, slower, and required more memory. See also: https://hicexplorer.readthedocs.io/en/latest/content/tools/hicDetectLoops.html.

Best,

Joachim

gdolsten commented 3 years ago

Ah, ok thanks. I think the problem for me is that the cool file may be balanced so HiCExplorer would not work. Do you know if anything can be done to use hicDetectLoops even with a balanced cooler file?

gdolsten commented 3 years ago

The mcool does have count and balanced data, but for some reason hicDetectLoops still finds zero loops, is it using the correct form?

chrom1 start1 end1 chrom2 start2 end2 count balanced chr3R 10000000 10000800 chr3R 10000000 10000800 4650 0.145219 chr3R 10000000 10000800 chr3R 10000800 10001600 5744 0.2025 chr3R 10000000 10000800 chr3R 10001600 10002400 755 0.0335492

joachimwolff commented 3 years ago

Please use a count matrix and not a 0 to 1 normalized one! Moreover, the loop detection should work fine on 5kb - 20 kb data, it was developed on 10 kb. What resolution do you have?

gdolsten commented 3 years ago

I am doing 1kb right now – as you can see the matrix has count a "count" column and a "balanced" column – would HiCExplorer automatically use the 'count' column or do I need to pass this in some how? This is a .mcool file

joachimwolff commented 3 years ago

I am not sure from where you have this data. A cool file usually has the 'count' column and a 'weight' column, and a 'balanced' one is not defined in the file format standard. Can you please run hicInfo -m matrix.mcool:/path/to/internal/cool and post the output?

Moreover, 1 kb is a too high resolution, please use 10 kb. Without knowing your data, I assume your read coverage is not high enough to work with 1 kb!

gdolsten commented 3 years ago

Here is the output:

Matrix information file. Created with HiCExplorer's hicInfo version 3.6

File: data.mcool::/resolutions/800 Date: 2021-01-13T23:51:30.755192 Genome assembly: unknown Size: 171,964 Bin_length: 800 Chromosomes:length: chr2L: 23513712 bp; chr2R: 25286936 bp; chr3L: 28110227 bp; chr3R: 32079331 bp; chr4: 1348131 bp; chrX: 23542271 bp; chrY: 3667352 bp; chrM: 19524 bp; Number of chromosomes: 8 Non-zero elements: 418,084,961 The following columns are available: ['chrom' 'start' 'end' 'weight']

joachimwolff commented 3 years ago

So that is 800 base pair resolution. I don’t say it will not work, however, all the default parameters for are for 10 kb resolution. Moreover, please investigate by plotting if you see any loops.

I highly recommend you use a 10 kb resolution, may I ask you what you want to achieve by detecting loops in such a high resolution? It is really uncommon.

Last, you can see that there is a weight column in the cooler filer, however, I have no clue from where you have the ‚balanced‘ field you mentioned above, neither how you generated this output. To test what value range you really have, I recommend to use the hicInfo tool with the parameter „no metadata“. Part of the tool output will give you the maximal value. However, to load a 800 basepair resolution matrix requires a lot memory, so you might run out of memory calling this command.

gdolsten commented 3 years ago

I will look into using the 10kb resolution, thanks for that advice, I am new to Hi-C analysis. In the meantime, what should I do if the weights are already normalized, since you mention this would mess up HiCExplorer?

joachimwolff commented 3 years ago

The weights are correction factors, I doubt they are normalized. I think you are major confused with the terminology in Hi-C. A matrix can be corrected with algorithms like ICE or KR, and the correction factors are stored in the 'weights' column of the cooler file. Please read the documentation of cooler to learn something about the structure and the data of cooler files and Hi-C.

To normalize in Hi-C means to adjust the value range for example to a certain read coverage i.e. the read coverage is the sum of all valid Hi-C pairs in your matrix. Other methods would be to normalize it to a 0- 1 range. While our tool hicNormalize does support both, because we don't want to tell the user what to do, and in the end, you are responsible for what you are doing in an analysis; we highly recommend not break the character of Hi-C. Hi-C data are interaction or count values i.e. discrete numbers. To apply the correction is already breaking this by transforming it into the continuous number space, however, we don't have anything better. The nature of the correction algorithms is to correct to a value range of 0 -1; however, some correction software keeps this (I think cooltools balance does this) but we and others (e.g. Juicer) think this is wrong and scale the value range back to the original one. For this reason, our algorithms expect to have values in the original value range, and not in a 0 -1 value range.

sabrsyed commented 2 years ago

Hi, hope it's okay for me to tag onto this post (which was very informative, thanks!) I'm using hicDetectLoops 3.4.3 on a Hi-C dataset. The dataset was analyzed using distiller-nf, aligned to hg38. When using the file.10000.cool that was generated, I detected 0 loops, even after making all of the parameters very liberal.

I got the following from hicinfo:

Matrix information file. Created with HiCExplorer's hicInfo version 3.4.3

File: file.10000.cool Size: 308,839 Bin_length: 10000 Sum of matrix: 362928.74353385356 Chromosomes:length: chr1: 248956422 bp; chr2: 242193529 bp; chr3: 198295559 bp; chr4: 190214555 bp; chr5: 181538259 bp; chr6: 170805979 bp; chr7: 159345973 bp; chr8: 145138636 bp; chr9: 138394717 bp; chr10: 133797422 bp; chr11: 135086622 bp; chr12: 133275309 bp; chr13: 114364328 bp; chr14: 107043718 bp; chr15: 101991189 bp; chr16: 90338345 bp; chr17: 83257441 bp; chr18: 80373285 bp; chr19: 58617616 bp; chr20: 64444167 bp; chr21: 46709983 bp; chr22: 50818468 bp; chrX: 156040895 bp; chrY: 57227415 bp; chrM: 16569 bp; Non-zero elements: 100,064,359 Minimum (non zero): 3.576611340339027e-07 Maximum: 7.668583851074945 NaN bins: 37506 The following columns are available: ['chrom' 'start' 'end' 'weight']

In reading over issue #681 and issue #375, I thought maybe my cooler file correction was a problem, so I then tried doing kr correction on file.10000.cool file. hicinfo shows the following now:

Matrix information file. Created with HiCExplorer's hicInfo version 3.4.3

File: file.10000.corrKR.cool Size: 308,839 Bin_length: 10000 Sum of matrix: 146378082.5 Chromosomes:length: chr1: 248956422 bp; chr2: 242193529 bp; chr3: 198295559 bp; chr4: 190214555 bp; chr5: 181538259 bp; chr6: 170805979 bp; chr7: 159345973 bp; chr8: 145138636 bp; chr9: 138394717 bp; chr10: 133797422 bp; chr11: 135086622 bp; chr12: 133275309 bp; chr13: 114364328 bp; chr14: 107043718 bp; chr15: 101991189 bp; chr16: 90338345 bp; chr17: 83257441 bp; chr18: 80373285 bp; chr19: 58617616 bp; chr20: 64444167 bp; chr21: 46709983 bp; chr22: 50818468 bp; chrX: 156040895 bp; chrY: 57227415 bp; chrM: 16569 bp; Non-zero elements: 102,214,396 Minimum (non zero): 1 Maximum: 931193 NaN bins: 21692 The following columns are available: ['chrom' 'start' 'end']

I am now able to call loops, but I'm only getting ~400 and my parameters are very loose (p-value = 0.1, pit=2). My understanding of file formats, corrections, etc is minimal so was hoping for your insight into what's going on with my dataset? In the meanwhile, I can test Rao et al. GSE63525_GM12878 using parameters you outlined in issue #375.