deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Boundary size difference after matrix correction #679

Closed Drosophilid closed 3 years ago

Drosophilid commented 3 years ago

Hi,

I have confusion regarding the TAD boundary size.

  1. Before correcting the matrix, when I call the TAD, the boundary size stays the same as my bin size.
  2. Whereas after matrix correction, when I call the TADs, the boundary size changes (become bigger than my bin size). is it normal or something wrong with my matrix? screenshots are attached

Thank you (1) Screenshot 2021-02-26 at 14 20 47 (2) Screenshot 2021-02-26 at 14 19 44

LeilyR commented 3 years ago

Which correction method did you use? Can you check if the bin sizes were already altered in the corrected matrix?

Drosophilid commented 3 years ago

Which correction method did you use?

I used ICE method and here is my command: hicCorrectMatrix correct --filterThreshold -1.2 4 -m chr2.h5 -o chr2.corrected.h5 Screenshot 2021-02-26 at 15 46 03

Can you check if the bin sizes were already altered in the corrected matrix? I don't know if bin sizes were altered.

LeilyR commented 3 years ago

which version are you using? the default is KR in the most recent version, so --filterThreshold -1.2 4 are not going to be used. Also you can try hicInfo for the matrix bin size. If they are not all equal to 5k it will tell you that your bins are homogenous.

Drosophilid commented 3 years ago

I'm using 2.2.1.1 version, but I can't find --correctionmethod option in this version.

hicCorrectMatrix: error: unrecognized arguments: --correctionMethod ICE Thanks

LeilyR commented 3 years ago

please update your tool to 3.6 which is our most recent version and let us know if you still had any problem left.

Drosophilid commented 3 years ago

I have uninstalled the old version and installed the new 3.6 version from https://github.com/deeptools/HiCExplorer/releases/tag/3.6 but after installing when I checked the version it still showing the 2.2.1.1 version. Can you please suggest how I can update the new version? Thank you

Screenshot 2021-02-26 at 19 04 57

joachimwolff commented 3 years ago

Please use conda and conda environments. There are a few issues here, maybe already closed ones where I describe how it exactly works.

Drosophilid commented 3 years ago

I have updated hicexplorer (version 3.6), but still, there is a difference in border size after ICE correction. This only happens after correcting the matrix per chr. If I correct all chromosomes together, there is no such issue. The problem is Y chromosome has a different total count vs frequency compared to other chromosomes in the genome and I think it's not a good idea to correct all chromosomes together. Can you please suggest something to avoid this border size difference issue to proceed with this case?

Thank you Screenshot 2021-03-06 at 17 59 20

joachimwolff commented 3 years ago

Chromosome Y is quite small and can cause issues; if you do not depend on it, it is usually a good idea to simply drop it.

Drosophilid commented 3 years ago

Thank you for your reply @joachimwolff I even tried to create a matrix even after removing the chrY from my genome using seqkit. But still, it did not solve the problem. The TAD boundaries are different in size compared to my defined bin size even after removing chrY. Then I tried KR correction method which works well but the total no of TADs are less in no (don't find any TAD on chrY with KR corrected matrix). The other reason to use ICE correction method is, I need to add chrY in my analysis. So, I really need to sort this out. any kind of help and suggestions will be appreciated. many Thanks

joachimwolff commented 3 years ago

Please provide a plot of chromosome Y via hicPlotMatrix using multiple resolutions.

//edit

Also provide the plots for the raw, the ICE and the KR corrected data. Looking at your diagram above, it seems you have not that many contacts on chromosome Y. The plots are necessary to investigate this more.

Drosophilid commented 3 years ago

Please find the attached image of chrY interaction matrix at different resolutions. Screenshot 2021-03-10 at 20 43 38

But the problem is even after excluding chrY, there are TAD boundary size differences after the ICE correction in other chromosomes as well.

joachimwolff commented 3 years ago

Concerning the TAD boundary sizes: This says only sth about the bin size of the specific boundary; however, it shall not influence the result that you have there a TAD.

The source code where this is calculated is the old legacy source code of a developer who is no longer involved in the HiCExplorer project. Therefore I need to check this old source in detail, which takes a while. Please be patient about this issue and understand I have a lot of other stuff to do, and I will come back to you about this as fast as possible.

My best guess for this is that some bins become NaN and are masked due to the correction. This can happen and is a property of the correction algorithms if too few interactions in specific areas are available. If this is the case, we are masking the areas, and if they are involved in the computation, an average of the border values with the next valid bin is calculated to determine the boundary size of the bin (this boundary is a bin boundary and has nothing to do with a TAD boundary!). For example: [(chr 1 5) (chr 5 10) (chr 10 15)] and let's say the middle one is NaN, the output boundaries of these bins will be: [(chr 1 7.5) (chr 7.5 15)] With this, we make sure all areas of the genome are covered, and there are no 'empty' or undefined regions.

The chr Y and TAD issue: I think you see it on your own: Your read coverage is way too low to detect anything on a 5 kb resolution on chr Y; you only plotted a 10 kb, and there are no structures, not on raw, not on ICE, not on KR. On 40kb, there are a few; however, still very weak. I would recommend using 100 kb or 200 kb for this chromosome if you want to detect something. If this is not an option, you need to have way better data than what you have at the moment.

joachimwolff commented 3 years ago

My best guess for this is that some bins become NaN and are masked due to the correction. This can happen and is a property of the correction algorithms if too few interactions in specific areas are available. If this is the case, we are masking the areas. If they are involved in the computation, an average of the border values with the next valid bin is calculated to determine the boundary size of the bin (this boundary is a bin boundary and has nothing to do with a TAD boundary!). For example: [(chr 1 5) (chr 5 10) (chr 10 15)] and let's say the middle one is NaN, the output boundaries of these bins will be: [(chr 1 7.5) (chr 7.5 15)] With this, we make sure all areas of the genome are covered, and there are no 'empty' or undefined regions.

To finally solve this issue: My theory was correct; this exactly what is happening and is the expected behavior. There is a combination of two events happening: Correcting and TAD computation.

  1. Correcting: If you use the ICE correction, we check in the end if certain bins are expanded more than expected: https://github.com/deeptools/HiCExplorer/blob/master/hicexplorer/hicCorrectMatrix.py#L753

    If this is the case, the concerning bins are masked, i.e., removed from the matrix. The bin intervals (or cut_intervals as they are called in the source code) do look after this, for example [(chr 0 5) (chr 5 10) (chr 10 15)] and let's say the middle one is NaN, the bin intervals will be: [(chr 0 5) (chr 10 15)]

  2. TAD computation: Before the boundaries are written out, their left and right boundaries are determined: https://github.com/deeptools/HiCExplorer/blob/e917b3e5ae3d15d81b7f364f73c575a337de6a69/hicexplorer/hicFindTADs.py#L994 and https://github.com/deeptools/HiCExplorer/blob/e917b3e5ae3d15d81b7f364f73c575a337de6a69/hicexplorer/hicFindTADs.py#L1042
right_bin_center = chr_start[idx] + int((chr_end[idx] - chr_start[idx]) / 2)
left_bin_center = chr_start[idx - 1] + int((chr_end[idx - 1] - chr_start[idx - 1]) / 2)

To translate this to the example from above [(chr 0 5), (chr 10 15)] :

right_bin_center = chr_start[1] + int((chr_end[1] - chr_start[1]) / 2)
left_bin_center = chr_start[1 - 1] + int((chr_end[1 - 1] - chr_start[1 - 1]) / 2)
right_bin_center = 10 + int((15 - 10) / 2)
left_bin_center = 0 + int((5 - 0) / 2)
right_bin_center = 10 + 2.5
left_bin_center = 0 +2.5
right_bin_center = 12.5
left_bin_center = 2.5

--> The boundary is (chr 2.5 12.5), and the bin size is 10 instead of 5.

If the middle element would not have been masked it would be [(chr 1 5) (chr 5 10) (chr 10 15)]:

right_bin_center = chr_start[1] + int((chr_end[1] - chr_start[1]) / 2)
left_bin_center = chr_start[1 - 1] + int((chr_end[1 - 1] - chr_start[1 - 1]) / 2)
right_bin_center = 5 + int((10 - 5) / 2)
left_bin_center = 0 + int((5 - 0) / 2)
right_bin_center = 5 + 2.5
left_bin_center = 0 +2.5
right_bin_center = 7.5
left_bin_center = 2.5

--> The boundary is (chr 2.5 7.5), and the bin size stays the same.

I hope this answer resolves your question.

Best,

Joachim

Drosophilid commented 3 years ago

Thank you so much @joachimwolff for the detailed answer, this makes more sense. Hope I'll get rid of it.

Thanks