deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
231 stars 70 forks source link

hicDetectLoops gives zero or artificial results #375

Closed Phlya closed 5 years ago

Phlya commented 5 years ago

I was excited to see hicExplorer has got a loop detection method and decided to try it. I used the deep Bonev et al mESCs Hi-C data. I have previously successfully called ~9000 loops using cooltools call-dots. However hicDetectLoops gives zero or artificial results. E.g. I run it on 10kb resolution data with recommended (default) settings with very liberal p-values hicDetectLoops -m $1::/resolutions/$2 -p 0.1 -pp 0.1 --chromosomes chr19 -o $1\_$2.loops.bed (and provided appropriate arguments to the script), but here is the output;

INFO:hicmatrix.HiCMatrix:Number of poor regions to remove: 327 {'chr19': 327}
INFO:hicmatrix.HiCMatrix:found existing 327 nan bins that will be included for masking 
INFO:hicexplorer.hicDetectLoops:Computed loops for chr19: 0
INFO:hicexplorer.hicDetectLoops:Number of detected loops for all regions: 0

I tried lowering -pit, then I can get one or two dozen calls in the chromosome, but they all correspond to some artefacts on the edges of empty regions of the Hi-C map. Just wanted to get your thoughts about this, and whether there is something wrong with how I am doing it, thanks.

joachimwolff commented 5 years ago

Hi Ilya,

Thanks for testing the new loop detection feature and I am sorry to hear it is not working as expected. I guess you are more or less the first user testing it outside of our lab, however, on Tuesday I am back in office and will test this dataset on my own.

I don’t see any wrong value from your side. I tested the loop detection on Rao 2014 dataset, GM12878 cells. Can you test if it is working there for you?

Thanks a lot.

Joachim

Ilya Flyamer notifications@github.com schrieb am Do. 18. Apr. 2019 um 18:09:

I was excited to see hicExplorer has got a loop detection method and decided to try it. I used the deep Bonev et al mESCs Hi-C data. I have previously successfully called ~9000 loops using cooltools call-dots. However hicDetectLoops gives zero or artificial results. E.g. I run it on 10kb resolution data with recommended (default) settings with very liberal p-values hicDetectLoops -m $1::/resolutions/$2 -p 0.1 -pp 0.1 --chromosomes chr19 -o $1_$2.loops.bed (and provided appropriate arguments to the script), but here is the output;

INFO:hicmatrix.HiCMatrix:Number of poor regions to remove: 327 {'chr19': 327} INFO:hicmatrix.HiCMatrix:found existing 327 nan bins that will be included for masking INFO:hicexplorer.hicDetectLoops:Computed loops for chr19: 0 INFO:hicexplorer.hicDetectLoops:Number of detected loops for all regions: 0

I tried lowering -pit, then I can get one or two dozen calls in the chromosome, but they all correspond to some artefacts on the edges of empty regions of the Hi-C map. Just wanted to get your thoughts about this, and whether there is something wrong with how I am doing it, thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/375, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGQCAAJRHHC6LX2VKCV4ODPRCMLHANCNFSM4HG6M7FQ .

-- Joachim Wolff M.Sc. Computer Science Chair for Bioinformatics Department of Computer Science Albert-Ludwigs-University Freiburg Georges-Koehler-Allee 079 D-79110 Freiburg

http://www.bioinf.uni-freiburg.de

Phlya commented 5 years ago

Thanks Joachim. I'm downloading a 10kb cooler for GM12878 MboI from ftp://cooler.csail.mit.edu/coolers/hg19/ to try it.

Phlya commented 5 years ago

Seems to work much better with that data, I get 2846 loops - I haven't checked them visually yet though.

Phlya commented 5 years ago

(Which is still far fewer than reported by Rao et al, by the way)

joachimwolff commented 5 years ago

Hi,

On Rao 2014 I detect with p-value set to p-value< 0.05 for negative binomial distribution candidate selection, p-value < 0.025 for Anderson-Darling test, a minimal peak interaction height of 20, a peak width of 6 and a window size of 10 on GSE63525_GM12878_insitu_primary+replicate_combined.hic.gz, GSE63525_GM12878_insitu_primary.hic.gz and GSE63525_GM12878_insitu_replicate.hic.gz from GSE63525, transformed to cool format via hic2cool for 10kb and applied KR correction via hicConvertFormat, 12315, 10125 and 9322 loops. Juicers HiCCUPS detects similar results, the overlapping of CTCF, RAD21 and SMC3 is similar between HiCExplorer and HiCCUPS.

On the other provided datasets by Rao 2014, the detection is not that good but HiCCUPS behaves here equally. My conclusion to this is, that it depends on the read coverage and therefore the sparsity of the Hi-C matrix. See attached image of the so far not published paper about this algorithm: loops

However a few questions from my side, have you detect 9000 loops just on chromosome 19 with cooltools? Or on the full genome? Have you tested how HiCCUPS behaves?

Phlya commented 5 years ago

Sorry for the confusion, I used the default settings for GM data, which, I think, match what you just described. I will repeat it with hic2cool conversion from the .hic files.

No, cooltools found ~9000 in the whole genome - and cooltools is reimplementation of HICCUPS, their results are extremely similar. However there, like in HICCUPS, I did loop calling in 5kb, 10 kb and 25 kb resolutions, and then merged them - and also the max distance was much longer, I think I was using 20 Mb. But the vast majority of the loops are within 2 Mb anyway.

Phlya commented 5 years ago

I just wanted to check how you "applied KR correction via hicConvertFormat" - smth like this?

hicConvertFormat -m $1.mcool::resolutions/{1000,5000,10000,25000,50000,100000,250000,500000,1000000} -o $1.KR.mcool --inputFormat cool --outputFormat mcool --correction_name KR

Would be great to add --correction_name argument to hicDetectLoops to use converted coolers directly, btw.

Phlya commented 5 years ago

This seemed to be working, but required more than 128Gb memory that I requested on the cluster... Does it load all of the data in memory at once?

It does work with single resolutions up to 5kb, so that's OK, but the memory usage is still way higher than I expected.

Phlya commented 5 years ago

OK. So far I have converted GSE63525_GM12878_insitu_primary.hic to mcool using hic2cool. Then used hicConvertFormat (hicConvertFormat -m $1.mcool::resolutions/10000 -o $1.KR_10000.cool --inputFormat cool --outputFormat cool --correction_name KR) and then detected loops with default settings (hicDetectLoops -m $1 -t 4 -o $1.loops.bed). Here is the output:

INFO:hicexplorer.hicDetectLoops:Computed loops for 1: 193
INFO:hicexplorer.hicDetectLoops:Computed loops for 2: 213
INFO:hicexplorer.hicDetectLoops:Computed loops for 3: 185
INFO:hicexplorer.hicDetectLoops:Computed loops for 4: 121
INFO:hicexplorer.hicDetectLoops:Computed loops for 5: 130
INFO:hicexplorer.hicDetectLoops:Computed loops for 6: 135
INFO:hicexplorer.hicDetectLoops:Computed loops for 7: 114
INFO:hicexplorer.hicDetectLoops:Computed loops for 8: 109
INFO:hicexplorer.hicDetectLoops:Computed loops for 9: 101
INFO:hicexplorer.hicDetectLoops:Computed loops for 10: 99
INFO:hicexplorer.hicDetectLoops:Computed loops for 11: 118
INFO:hicexplorer.hicDetectLoops:Computed loops for 12: 132
INFO:hicexplorer.hicDetectLoops:Computed loops for 13: 77
INFO:hicexplorer.hicDetectLoops:Computed loops for 14: 73
INFO:hicexplorer.hicDetectLoops:Computed loops for 15: 87
INFO:hicexplorer.hicDetectLoops:Computed loops for 16: 89
INFO:hicexplorer.hicDetectLoops:Computed loops for 17: 74
INFO:hicexplorer.hicDetectLoops:Computed loops for 18: 54
INFO:hicexplorer.hicDetectLoops:Computed loops for 19: 93
INFO:hicexplorer.hicDetectLoops:Computed loops for 21: 35
INFO:hicexplorer.hicDetectLoops:Computed loops for 20: 63
INFO:hicexplorer.hicDetectLoops:Computed loops for 22: 54
WARNING:hicmatrix.HiCMatrix:Bin size is not homogeneous.                                       Median 10000

INFO:hicexplorer.hicDetectLoops:Computed loops for MT: 0
INFO:hicexplorer.hicDetectLoops:Computed loops for Y: 0
INFO:hicexplorer.hicDetectLoops:Computed loops for X: 93
INFO:hicexplorer.hicDetectLoops:Number of detected loops for all regions: 2442

So somehow I get fewer than 1/4 of the loops that you can detect... Any idea what might be causing this?

Waiting for other files to be processed too, all these conversions take surprisingly long.

Phlya commented 5 years ago

Couple of observations, which may or may not be related:

image

Phlya commented 5 years ago

OK, basically, the weights in the original mcools are correct and they are divisive, and the converted coolers are fine with correct multiplicative weights too - the weirdness is that the weights don't bring the total of each column/row to 1, so very small numbers close to the telomere confused me to think there was a problem with this. I still think converting NaNs to 1 is not the right thing to do though...

I have now run all files through, and I get too few loop calls everywhere:

2243 GSE63525_GM12878_insitu_primary_30.hic.KR_10000.cool.loops.bed
2442 GSE63525_GM12878_insitu_primary.hic.KR_10000.cool.loops.bed
2736 GSE63525_GM12878_insitu_primary+replicate_combined_30.hic.KR_10000.cool.loops.bed
3012 GSE63525_GM12878_insitu_primary+replicate_combined.hic.KR_10000.cool.loops.bed
1969 GSE63525_GM12878_insitu_replicate_30.hic.KR_10000.cool.loops.bed
2340 GSE63525_GM12878_insitu_replicate.hic.KR_10000.cool.loops.bed
Phlya commented 5 years ago

Another observation. The cooler I obtained from cooler.csail.mit.edu was create very long ago, and the weights were not normalizing the row/col sums to 1. If I rebalance it with the new cooler, so that row/cols sums equal 1, hicDetectLoops can't find anything - 0 loops in total. So I guess it can't handle reads after balancing below 1? I guess this would be the reason why it failed on Bonev data that was created with much newer cooler.

joachimwolff commented 5 years ago

Hi Ilya,

many thanks for your extensive testing and debugging of HiCExplorer.

There are some issues you recognized and not all are related, however, are annoying and a bit frustrating (also for us).

The problematic with the file formats and correction factors is that the 'big guys' aka 4D nucleosome consortium cannot agree how to store the data. This results in two major file formats: hic and cool. hic applies and stores the data in divisive format, cool stores it in multiplicative. With hic2cool there is a conversion tool available but their developers decided to increase the mess: With version < 0.4.2 hic2cool converts the divisive to multiplicative, starting with version 0.5 they keep divisive format. With HiCExplorer 3.0 and HiCMatrix 9 I try to cover this, however, if a cool file was created without a version tag I assume it is multiplicative. If it is an older file I don't know if any metadata is written or if a tool wrote the correction factors on its own and than I cannot test if it is multiplicative or divisive. HiCMatrix version 9 is quite new, maybe not everything is already covered as it should be. Another issue is where to store the values. Default in cooler is that the correction factors are stored in weights and this is the only column name we support for all HiCExplorer tools to apply the correction. hic and hic2cool store them in columns like KR or VC, I decided to not support this because it is against the standard and makes the source code unnecessary complex. Even more important: I think that the correction is usually applied once and not multiple corrections are tested in an analysis. What we support is that the format can be converted via hicConvertFormat. The correct command to do so is to apply the correction only on one resolution, multicool is not supported. Finally, the correction itself. The correction algorithms try to make each sum of row/col should be equal or close to 1. This is fine, but our viewpoint is that Hi-C is count data and therefore it needs to be scaled to the previous value range after the correction. None of the HiCExplorer tools works good (or at all) with a value range 0 - 1. Especially the hicDetectLoops will not detect anything: First there is a parameter called peakInteractionsThreshold with a default value of 20: 'The minimum number of interactions a detected peaks needs to have to be considered. The number of interactions decreases with increasing genomic distances: peakInteractionsThreshold/log(genomic distance)' With this parameter too small but still detected regions are pruned, the reason here is that I don't think smaller interaction values are really from an enriched region. Second, our test of the peak region vs. the background needs to detect a significant difference. If the interaction values are all quite low, no difference in the distributions will be detected. This last fact is quite important because it related to the read coverage. If the read coverage is too less to detect a region, nothing will be detected. I assume for the files you tested this was the case.

Nan vs 1 issue: If Nans are used, some computations can fail because in numpy it is handled the following: As soon as one value of an operation is nan, the result is nan too. To cover this, all Nans are set to 1, the multiplicative/divisive neutral element.

I tested with Rao2014-GM12878-MboI-allreps-filtered.10kb.cool from the ftp server you have mentioned. With default settings it detects 8620 loops what is in an expected range, the detected loops are in the majority what I am expecting to see. For example chr1 18-22Mb and chr4 18-22Mb. On chr4 there are two, three loops that are false positives.

chr1_18-22 chr4_18-22

joachimwolff commented 5 years ago

Do to lack of documentation: How did you run the cooltools call-dots? I see you need some expected signal file, how do you create this?

Phlya commented 5 years ago

Thank you for your explanation. I am aware about the whole multiplicative/divisive problem, I thought something about that was wrong, but it wasn't. And yeah, I now see the problem why coolers with weights scaling everything to 1 don't work with this tool... But would it work with setting minimal number of interactions to a small number below 0?

I really don't understand why it doesn't work for me with the cooler file from csail, maybe it's a version issue of something? Could you share your environment info? Your results look very good.

I am sorry I am currently away, but I used the loop annotations in this preprint, and methods should describe how I did it https://www.biorxiv.org/content/10.1101/527309v1

Phlya commented 5 years ago

"We used cooltools call-dots reimplementation of the HiCCUPS algorithm (Rao et al., 2014) from dekkerlab/shrink-donut-dotfinder (commit 377106e). This was applied with default settings (except for lower FDR threshold of 0.1) to reanalysed mapq≥30 filtered mESC Hi-C data at 5 kb, 10 kb and 25 kb resolution to find areas of local enrichment of interactions between loci up to 20 Mb away. Calls from different resolutions were combined using a custom script following the HiCCUPS merging procedure."

If you want, the cooler file I used is available here: https://github.com/Phlya/coolpuppy_paper/tree/master/coolers

And here is the function I used to merge different resolutions, by the way: https://gist.github.com/Phlya/340af6c45900902310a7bb8d9cc60537

Phlya commented 5 years ago

I just tried giving a very small number instead of 20 for -pit, and it fails because it expects an int value. Any specific reason for this? This would be the simplest solution to use cooler-balanced data...

joachimwolff commented 5 years ago

One reason is the thought that we work with count data, and count data is always a positive integer. I am not a supporter of the idea to normalize everything to a 0 - 1 range, or even worse to accept negative values.

Another reason is that I am not sure atm how good our statistical test (Anderson-Darling) works with non-count data, if the differences in the two distributions are big enough to be detected if all values are quite small. I guess it will not detect anything as significant different.

About my environment: Nothing special, up-to-date HiCExplorer version 3 and all dependencies updated. cooler version 0.8.3, hic2cool version 0.5, python 3.6, HiCMatrix version 9. Maybe it helps to install everything in a new and clean conda environment to exclude possible interference with other (older?) versions of a dependency.

I try to have a closer look at the linked files from you as fast as possible.

Phlya commented 5 years ago

Well, you can consider Hi-C data as probability instead of count, I suppose? I don't see any use for negative values, though, too.

I think in HiCCUPS there is a trick to convert normalized values back to unnormalized at some stage so that their distribution can be described as Poisson? Maybe something similar could be used here?

Anyway, it should also be possible to divide the weights by their mean, perhaps, to bring to be around one, and get back to original unnormalized data scale?

Thank you for looking into this!

joachimwolff commented 5 years ago

Hi,

I wanted to download the Bonev mcool file, but it is managed by git lfs and if I try to fetch it, I get the error that the files do not exists on the server.

Fetching master
Git LFS: (0 of 0 files, 7 skipped) 0 B / 0 B, 20.77 GB skipped                                                                                                                                                                                                                               [043ef0312031cc7a89f5ad1d08a0ce768c8ee4f4995f23716ab819c22ddeb3f3] Object does not exist on the server: [404] Object does not exist on the server
[c26aefe05f4676e74371ac9864af085c40a62868969bcc48a7fb70ecb8b24e5b] Object does not exist on the server: [404] Object does not exist on the server
[f0b1aca5e056d0fa838beb344bc0dc4b39809608355de6a02534e57149ad767c] Object does not exist on the server: [404] Object does not exist on the server
[204dff32b042d88a8e2f8082091b7e88113c4301a4936b1a05832a08d7f7e07c] Object does not exist on the server: [404] Object does not exist on the server
[36966b2276b667b2b92992972ac15d9620e954053599af58c6967f6784af720a] Object does not exist on the server: [404] Object does not exist on the server
[2699f923ca43e95e48e051a32b5921a19d4ed1799c4d14c2872034cf8e7450a4] Object does not exist on the server: [404] Object does not exist on the server
[955d17a87afe56e287bc7808ff05680a2aed1faa983e351e089ecf68fd2dff1a] Object does not exist on the server: [404] Object does not exist on the server
error: failed to fetch some objects from 'https://github.com/Phlya/coolpuppy_paper.git/info/lfs'

Have I done sth wrong? Or is it possible for you to provide me the files in another way?

Best,

Joachim

Phlya commented 5 years ago

I'm not sure why it wouldn't work with git fetch... I have very little experience with git lfs. But I sent you a link by email.

Phlya commented 5 years ago

Indeed it was a problem with versions of something. I created a new environment with same versions as you mention above, and the results are more consistent with yours - although not identical, still slightly smaller numbers.

   6626 GSE63525_GM12878_insitu_primary_30.hic.KR_10000.cool.loops.bed
   7569 GSE63525_GM12878_insitu_primary.hic.KR_10000.cool.loops.bed
   8416 GSE63525_GM12878_insitu_primary+replicate_combined_30.hic.KR_10000.cool.loops.bed
   9218 GSE63525_GM12878_insitu_primary+replicate_combined.hic.KR_10000.cool.loops.bed
   5709 GSE63525_GM12878_insitu_replicate_30.hic.KR_10000.cool.loops.bed
   6872 GSE63525_GM12878_insitu_replicate.hic.KR_10000.cool.loops.bed
   8620 Rao2014-GM12878-MboI-allreps-filtered.10kb.cool.loops.bed

The versions of everything are fairly recent in both environments, but there are a lot of differences. I looked at some obvious candidates (cooler, numpy, scipy, hicexplorer), and either versions are the same, or there is nothing obvious in release notes about changes that could affect something. But something to watch out for, something is very sensitive to versions of some packages - and setting the versions of packages you told me about helps, but still results are not the same.

joachimwolff commented 5 years ago

Thanks I received your mail and I am downloading the data. Will take an hour or two.

Concerning the contact probabilities: The problematic is that if an algorithm is design for count data it is not guaranteed it works with probabilities too. For example some statistical test do only work with integers. Another issue with probabilities is that they are float numbers and floats can be to inaccurate especially if they are really small. This changes the data and can corrupt the signals. A solution could be to have an additional parameter to simply multiply all probability values with e.g. 10000 to have count data again.

Concerning the version numbers: I will recalculate the loops with the data and will write and document each step of the analysis in the hope you can reproduce it. A usual candidate is numpy, we always have to adjust our test cases from version to version and we test only for the first decimal digit, a higher precision is usually not kept between the versions.

Phlya commented 5 years ago

The best would be to just divide the weights by their mean to make them be around 1, don't you think? To bring the scale of balanced data back to the original unbalanced values, and to how HiCExplorer normally has the weights..

Numpy versions vary only by one last digit, in my older environment it was 1.16.2.

I also want to bring the NaN issue again - I understand making sure all steps work correctly with input that contains NaNs is a pain, but having calls in abnormal regions of very high/low coverage that would otherwise be "masked" is an easy way to get artefacts... In cooltools all regions of interest containing more than a certain number of NaNs are simply ignored, I think.

Phlya commented 5 years ago

"Do to lack of documentation: How did you run the cooltools call-dots? I see you need some expected signal file, how do you create this?" Sorry, forgot to reply about expected - cooltools compute-expected can be used to make it.