deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Matrix resolution #52

Closed apaytuvi closed 7 years ago

apaytuvi commented 7 years ago

How do you choose the best resolution? How do I know that my matrix is too sparse?

Thank you.

apaytuvi commented 7 years ago

For example, which resolution would you choose here to find TADs (the first one or the second one)?

comparison

fidelram commented 7 years ago

I would use the smallest resolution matrix that you have. You can confirm the TAD classification by plotting the TAD-separation score (produced by hicFindTADs TAD_score) together with the Hi-C maps. The track for this should be something like:

[bedgraph matrix]
# a bedgraph matrix file is like a bedgraph, except that per bin there
# are more than one value separated by tab: E.g.
# chrX       18279   40131   0.399113        0.364118        0.320857        0.274307
# chrX       40132   54262   0.479340        0.425471        0.366541        0.324736
# bedgraph matrices are produced by hicFindTADs
file = tad_score.bm
title = TAD separation score
width = 1.5
orientation = inverted
min_value = 0.10
max_value = 0.70
# if type is set as lines, then the TAD score lines are drawn instead
# of the matrix
# set to lines if a heatmap representing the matrix
# is not wanted
type = lines
file_type = bedgraph_matrix

See the documentation in http://hicexplorer.readthedocs.io/en/latest/content/tools/hicPlotTADs.html#hicplottads

I notice that your Hi-C contact matrix has negative values. The hicFindTADs program may not work correctly in this case as it expect positive values. Also, the second matrix in your plot (the lower resolution one) has some bins that probably should have been filtered because they seem not to have much reads. How did you generated your Hi-C matrices?

apaytuvi commented 7 years ago

Hi @fidelram

I've plotted the scores:

test

but it's hard to interpret it. What do these lines in the plot mean?

On the other hand, I checked my matrices np.min(), and there's no negative values. If you say that by looking at the legend of the plot, since I am using -log to plot (due to https://github.com/maxplanck-ie/HiCExplorer/issues/55), I get negative values.

Regarding the thresholds used, below the diagnostic plot for the lowest resolution:

ago1_20000

I corrected with -t -1.42134044256 6. This -1.42134044256 was the value given by hicCorrectMatrix diagnostic_plot.

fidelram commented 7 years ago

Hi,

  1. In your TADs plot, the TAD-separation escore is not seen complete. You need to adjust the max_value of the track or remove it. Once you have the full track you can identify boundaries as the local minima of the plot.

  2. I would recommend a correction of -1.2. Usually, the hicCorrect diagnostic plot is bimodal but his is not happening in your case. Since there are not properly removed bins, you may want to correct using a more stringent value. In the plot, each dotted line es plotted at 0.2 intervals.

apaytuvi commented 7 years ago

Hi @fidelram , thank you. I did corrected again with -1.2 and 4. I've plotted (without transformation) my two resolutions with the corresponding TADs:

plots_tad

And I'm not sure how reliable are the TADs, boundaries barely match.

So would you use the one with the lowest resolution?

fidelram commented 7 years ago

Your Hi-C does not look very good. Compare the same region in our chorogenome browser:

http://chorogenome.ie-freiburg.mpg.de:5002/#browser/1:20000000-22000000

There the TAD-score is clearly seen separating the TADs

apaytuvi commented 7 years ago

This is log transformed data or just the normalized counts?

Maybe I don't have enough reads to go into this resolution?

fidelram commented 7 years ago

that's log1p transformed data. How many reads do you have?

apaytuvi commented 7 years ago

aprox 75.000.000 pairs of reads

fidelram commented 7 years ago

Did you process the reads using HiCExplorer? I would be curious to know the QC for this dataset. 75 million reads is not much for a Hi-C experiment but, nevertheless, I would expect better Hi-C contact matrices than the ones you have.

If you do not have the QC values, you can get an idea of the quality of the data by running hicBuildMatrix with the parameter --doTestRun.