deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
231 stars 70 forks source link

Tad calling parameters help #445

Closed nyuhic closed 4 years ago

nyuhic commented 4 years ago

Hi,

I am using the hicfindtads function to call tads with a resolution of 50kb on my data. I then compare the difference in tads boundaries between my treatment and control condition by checking (with bedtools) how many tad boundaries overlap with each other.

The problem I am having is that even when I compare tad boundaries between biological or technical replicates, I get a large number of boundaries (~25%) that don’t overlap and are unique to each replicate. To make the tad calling more conservative, I have tried delta values from 0.01 to 0.15 and fdr values from 0.001 to 0.05 but I still get a lot of variation between replicates. I have even tried extending boundary by 100kb on each side before comparing with bedtools to see if that reduces the variation among replicates but is has little effect.

I was wondering if you have any recommendations on which parameters would have the biggest effect on tad calling and make it even more conservative and reduce variation between replicates?

Thanks

joachimwolff commented 4 years ago

Hi,

have you normalized your data?

Best,

Joachim

nyuhic commented 4 years ago

Thanks for the reply. The original raw matrices were in hic format which I converted from hic to cool and then again form cool to cool with correction KR.

gtrichard commented 4 years ago

Hello,

Can you compare the insulation score profiles with hicPlotTADs? Sometimes this is a good way to validate that the Hi-C signal is actually very similar between replicates.

You can also plot log2ratio matrices (hicCompareMatrices then hicPlotMatrix) to see if the signal is equivalent or not.

As Joachim posted, normalization can help (through hicNormalize, or FASTQ downsampling if the difference of coverage is huge between your samples) to call TAD boundaries in a more reproducible manner.

joachimwolff commented 4 years ago

Please do the following:

If you cannot make sure the value ranges are equal it is quite likely to have different boundaries in replicates.

nyuhic commented 4 years ago

Thanks a lot! I will try this and update.

nyuhic commented 4 years ago

I did the normalization on cool matrix followed by hicCorrectMatrix but unfortunately it didn't make a difference to the variation among replicates. I will also check the signal with hicPlotTADs and hicCompareMatrices.

Also I noticed that when I convert the hic matrix to cool or normalize/correct it I get a warning and was wondering if it could be a problem:

WARNING:hicmatrix.lib.cool:Writing non-standard cooler matrix. Datatype of matrix['count'] is: float64

rachadele commented 4 years ago

Hi, I have a similar question regarding TAD boundaries. I'm trying to compare boundaries across individuals of the same species, which according to the literature as I understand it should be pretty similar however, the percent intersection I find with bedtools jaccard is around ~15%? I called TADs from .mcool files which have been normalized using cooler balance. I did not normalize or correct the mcool files using hicCorrectMatrix or hicNormalize, and I am unclear on whether or not to use either on a cooler file that has already been balanced. I am aware that there have been some issues with multiplicative cooler normalization vs divisive h5 normalization– what is the best way for me to normalize and correct my files?

LeilyR commented 4 years ago

The difficulty with TAD calling is that it is really parameter dependent. I would highly recommend you to visualize your data and check for the boundaries you called and those that you compared you boundries with. You might need to tweak the parameters quite a bit to matches your data. For example you want to avoid seeing very tiny domains or unusually big ones, this can be improved by adjusting the parameters. The format shouldn't matter.

gtrichard commented 4 years ago

If it is a matter of QC, please use the Zscore for each sample calculated by hicFindTADs and compare them using for instance multiBigwigSummary from deeptools (https://deeptools.readthedocs.io/en/develop/content/tools/multiBigwigSummary.html) followed by plotCorrelation (https://deeptools.readthedocs.io/en/develop/content/tools/plotCorrelation.html).

Alternatively you can plot the different samples Hi-C tracks and their respective hicFindTADs Zscore on top of each other using hicPlotTADs and check the overall similarity of signals distribution. It is also a good way to check for "differential" loci that can be identified from comparing the Zscores (dots outside of the diagonal on the plotCorrelation plots).

The overlap of the exact position of TAD boundaries is rarely giving a significant measure of reproducibility of TAD calling because, more often than not, TAD boundaries position is shifted +-1 to 2 bins depending on the sample because of how hicFindTADs work. Zscores are much more robust because they depend on much less parameters and are less stochastic.

The best is to use as I said the zscore between samples, and if you want to have a consensus on the TAD boundaries position, the simplest is to merge all replicates (if applicable) and tweak the parameters until the TADs distribution is satisfying enough for given regions... (hence "by eye").

On a side note, "TADs" being often nested structures, there's no "ground truth" in their definition (until you perfectly know the underlying mechanisms of their definition), i.e., different set of parameters will give satisfying TADs distribution depending on the resolution of the matrix. Hence it can be a good idea to define different set of TADs (from large to small "TADs") and check if your hypothesis still stand independently of the TADs set used.

rachadele commented 4 years ago

thank you, this was very helpful! I'd still like to know whether cooler balance is sufficient for normalizing/correcting my data before calling TADs, and if not what my procedure should be for using hicNormalize and/or hicCorrectMatrix?

essentially, I have already balanced all my mcool files and would like to know whether or not I should work with them as is.

gtrichard commented 4 years ago

TAD calling and the scale (not the distribution) of the z-scores are affected by sequencing depth.

If you have slight differences of coverage between your samples, hicNormalize should be enough. The balancing / correction of the matrix doesn't help with coverage differences (these corrections modify the distribution of contacts, not the amount of contacts).

If you have important differences of coverage between your samples (fold-change >2), it might be a better idea to down-sample your fastq files, by extracting a given amount of random reads. This is because matrices with more coverage will be much less sparse than the matrices with low coverage, and this can result in differences at multiple level of the Hi-C data analysis (hicFindTADs and hicAggregateContacts for instance).

rachadele commented 4 years ago

thank you! just to be clear, I can use hicNormalize on cooler files that have already been corrected with cooler balance ?

gtrichard commented 4 years ago

Yes and no: it's better to use hicNormalize on uncorrected matrices. But I think that cooler balance stores correction factors in separated columns (it doesn't modify the values of the matrix per se), so hicNormalize should work on the uncorrected matrix of a cool file. But the normalization may change the correction factors if you cooler balance after hicNormalize... @joachimwolff, aka the cool expert :smile: , can you confirm?

So it's always better to first hicNormalize and then hicCorrectMatrix. You can play a bit with the cooler balance before / after hicNormalize and compare the results to hicNormalize > hicCorrectMatrix (which is the intended use).

I don't recommend the 0 to 1 normalization since many tools will fail to work on those contacts values (hicPCA for instance if I remember well), please use "smallest" when using hicNormalize

LeilyR commented 4 years ago

If you want to use hic normalize because your ctrl sample is much deeper than your treatment sample, i would suggest you call TADs on the 'not' normalized but corrected ctrl matrix to get lets say the 'real' TAD boundaries. You can then call tads on the normalised + corrected to see the difference but i think in that case you use the 'real' ones as a reference. @gtrichard What do you think of this approach?

gtrichard commented 4 years ago

I agree with @LeilyR , to get your consensus TAD boundaries position, the more contacts available the better. So it's definitely the way to go:

1) Call TADs on unnormalized corrected control matrix to get the BED file of TAD boundaries: control_TADs.bed.

2) Normalize the control and the treated samples matrices, correct the matrices, and call TADs again, but this time, try to plot the newly computed Zscores norm_control_zscore.bw and norm_treatment_zscore.bw around the previously called control_TADs.bed using deeptools computeMatrix and plotHeatmap. With some clustering you should be able to spot "lost" TAD boundaries, if there are any.

For "newly formed" TAD boundaries (which are pretty rare, but it depends on your experiments), you can try to compare the norm_control_zscore.bw and norm_treatment_zscore.bw values and extract the specific spots of the genome where the value in norm_treatment_zscore.bw is way below norm_control_zscore.bw (the lower the value, the more likely a TAD boundary is present). You can then verify these spots using pyGenomeTracks by creating a bed file of all the regions you want to verify. pyGenomeTracks will create one image for each spot, and thus you can check by eye if the topology is different enough or not: https://github.com/deeptools/pyGenomeTracks .

You can apply a similar method for "lost TADs" too, by extracting the coordinates of the spots with high values in norm_treatment_zscore.bw compared to norm_control_zscore.bw ...

The difference of the two bigwigs should give a nice and informative 2D track about potential differential TAD boundaries. Ranking that bigwig from lowest to highest should give you "lost" and "newly acquired" TADs. Most of the time it's more like "mostly affected" and "not affected", but well it depends on your genome topology phenotype.

rachadele commented 4 years ago

hi! thanks so much for your help. I did not get different boundaries after normalizing the matrices, but visualizing my control/treatment samples with pygenometracks was of great help. is there any other method you would recommend to get more consistent boundaries across samples? I was thinking about trying to extract raw values from the cooler, then normalize and re-correct them, but i'm not sure how to do this.

gtrichard commented 4 years ago

I don't know if what you propose would change something or not, but I would convert the unbalanced coolers to h5 with hicConvertFormat then hicNormalize and hicCorrectMatrix.

Normalization might have an effect only if the difference of coverage is rather important between samples... But if it's too important (Fold Change of the sum of contacts > 2 between samples), then it's better to downsample the high coverage FASTQ to the level of the least covered FASTQ and build the matrices from scratch from those downsampled FASTQs.

But as I said earlier: get your consensus boundaries from the control and compare the zscore / insulation score (same thing) between control and treatment globally to see if you really have TAD distribution modifications and validate them with pyGenomeTracks. This is the best way to avoid some spurious TAD detection between samples...

bedtools fisher on TAD boundaries extended +- 1 bin can also help you, but I'm not sure it's the best way to check TAD calling consistency.

You also have to remember that Hi-C is not only about TADs, it's also about A/B compartments, contacts distribution (hicPlotDistVsCounts) and contacts aggregation for instance. TADs are somewhat boring since they rarely change :smiley:

rachadele commented 4 years ago

when I try to plot my .mcool files or zscore .h5 files using either hicPlotTADS or pyGenome tracks I face the same issue as ((https://github.com/deeptools/pyGenomeTracks/issues/107)). since my files seem to be in the accepted format I don't see why they can't be plotted? I've already successfully visualized this matrix on HiGlass.

gtrichard commented 4 years ago

About zscore I am talking about the BedGraph file produced by hicFindTADs. What do you mean by Zscore matrix? I don't think we currently have a tool to officially produce those. Only obs/exp and Pearson.

For the mcool, I think you need to give the resolution you would like to use: matrices.mcool::\2

You can try using hicPlotMatrix first to see if it works.

rachadele commented 4 years ago

I can plot the matrix fine on the command line but when I try to make a tracks.ini file with my matrix, neither hicPlotTADs nor pyGenomeTracks can find the cooler (even when I specify the resolution).

LeilyR commented 4 years ago

@rachadele do you still have problem using pgt to plot your matrices or you problem is solved? Please open a separate issue for the if you problem is still there.

LeilyR commented 4 years ago

@nyuhic I assume you got your answer about suitable parameters for tad calling , just one last thing about the warning you get, you can easily ignore that. I close this issue now but feel free to reopen it if there is still any question about tad calling