Pairwise and multi-datasets A/B compartment calling with different resolutions

Hi, Thanks for the effort your team has put into making dcHiC. It's really great to have a method that support the multiple datasets comparison. I have some inquiries and would greatly appreciate your insights.

In the differential compartment analysis results (DifferentialResult) , I mostly check the sample group's .bedGraph file (differential.intra_sample_group) to get the information of significant differential bins for each group. Refer to the previous issue #53, I took the pcOri value to define the A(+)/B(-) compartments for each bin. This approach has enabled me to easily identify significant A/B transitions or matching compartment patterns from the results.

When extending this analysis to multiple datasets or stages, I'm curious about which part of the multiple stages exhibit significant changes and it may be hard to know from the differential compartment results. I planned to conduct pairwise comparisons between neighboring stages, extracting the pairwise differential results. Taking the pairwise differential results, I think it can help to identify pivotal stages of the transition and calculating the proportion of significant/non-significant flipping or matching transitions. Does this method seem reasonable for multiple datasets analysis?

The second question I would like to have your advice is the choice of resolution (chromosome bin size) used for the dcHiC analysis. In the Nat. Commun. paper, you show the results of down-sampling at various resolutions to assess the impact of different sequencing depths and evaluated the potential of false-positive rate. These results are quite important for us to determine an appropriate resolution for our analysis.

In your paper, you mentioned that higher resolution differential analysis may be more prone to false positives., along with the observation in Supplementary Table S8 that there were no false positives among samples at 100Kb resolution with >300M (60% down-sampling). The current data I am working with has a read depth of approximately 55M per replicate and 100M~250M for all replicates in one sample group. Taking this into account, I initially performed my analysis using 100Kb and 200Kb resolutions. Furthermore, as I delve into Supplementary Table S5, only few differential compartments can be identified between pseudo-replicates at 10% down-sampling rate (sequencing depth 500M --> 50M) with high resolutions/small bin sizes (25Kb and 10Kb). This raises the question of whether a higher resolution, such as 10Kb, could be appropriate for my dcHiC analysis.

In my current results utilizing resolutions of 100Kb and 200Kb, I found the resolution-specific region (detectable only at higher or lower resolution) that align with the discussion in your paper (Supplementary Fig. S6) Moreover, certain regions even display an opposite pcOri sign (opposite A/B calls).

While I comprehend that one of the reasons behind this phenomenon is the averaging of interactions from neighboring bins due to the use of coarse resolutions, resulting in inconsistent A/B compartment assignments, I remain uncertain about whether a higher resolution would indeed offer increased reliability.

This issue has made it challenging for me to decide which result is better, particularly given the significance of classifying regions into A or B status, which serves as an indicator of active and inactive chromatin. I believe integrating additional omics data may help, but sometimes it also makes things to be more complicated.

Learning from your prior experiences and benefiting from your advice in dealing with resolution-specific results and managing inconsistencies arising from different resolutions would be highly valuable. Your insights and guidance on this matter would be greatly appreciated.

Hi, Thanks for your support. Please check my reply

_In the differential compartment analysis results (DifferentialResult) , I mostly check the sample group's .bedGraph file (differential.intra_samplegroup) to get the information of significant differential bins for each group. Refer to the previous issue https://github.com/ay-lab/dcHiC/issues/53, I took the pcOri value to define the A(+)/B(-) compartments for each bin. This approach has enabled me to easily identify significant A/B transitions or matching compartment patterns from the results.

Yes, you should look at the pcOri value to define compartments.

When extending this analysis to multiple datasets or stages, I'm curious about which part of the multiple stages exhibit significant changes and it may be hard to know from the differential compartment results. I planned to conduct pairwise comparisons between neighboring stages, extracting the pairwise differential results. Taking the pairwise differential results, I think it can help to identify pivotal stages of the transition and calculating the proportion of significant/non-significant flipping or matching transitions. Does this method seem reasonable for multiple datasets analysis?

You can do that too, but then you will have multiple comparisons and their p-values. Sometimes, it may not have the power to identify change. I think an easier option will be to calculate the Z-score of bin-wise compartment scores across samples and then use a Z-score cutoff to find the exact outliers. By doing so, you can have a single p-value and identify the outlier samples.

The second question I would like to have your advice is the choice of resolution (chromosome bin size) used for the dcHiC analysis. In the Nat. Commun. paper, you show the results of down-sampling at various resolutions to assess the impact of different sequencing depths and evaluated the potential of false-positive rate. These results are quite important for us to determine an appropriate resolution for our analysis. In your paper, you mentioned that higher resolution differential analysis may be more prone to false positives., along with the observation in Supplementary Table S8 that there were no false positives among samples at 100Kb resolution with >300M (60% down-sampling). The current data I am working with has a read depth of approximately 55M per replicate and 100M~250M for all replicates in one sample group. Taking this into account, I initially performed my analysis using 100Kb and 200Kb resolutions. Furthermore, as I delve into Supplementary Table S5, only few differential compartments can be identified between pseudo-replicates at 10% down-sampling rate (sequencing depth 500M --> 50M) with high resolutions/small bin sizes (25Kb and 10Kb). This raises the question of whether a higher resolution, such as 10Kb, could be appropriate for my dcHiC analysis.

First, I would suggest you to combine the replicates per group and perform the dcHiC analysis with 100-250M depth data. A high correlation of replicate compartment scores within a group compared to across the groups is good enough to justify the approach. Otherwise, with only 55M depth, you will probably lose important differential regions. Second, with a 100-250M depth, I will not be comfortable doing a compartment analysis at 10Kb resolution unless I have an independent way to validate what I am looking at. For example, you can use different ChiP-seq , ATAC-seq data to validate/support your claim. From personal experience, I will be comfortable doing a 40/50Kb dcHiC analysis with that depth but not higher than this.

Check this one which also suggest doing a 100M Hi-C data analysis at 40Kb resolution https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347522/
"In our experience,an adequately complex Hi-C dataset for the human genome with roughly 100 million mapped / valid junction reads, is sufficient to support a 40kb data resolution. Data below 40kb may be usable, though it will suffer from a higher level of noise. It is important to note that effective resolution scales with genomic distance, such that short-range interactions will typically have higher coverage and thus higher effective resolution."

In my current results utilizing resolutions of 100Kb and 200Kb, I found the resolution-specific region (detectable only at higher or lower resolution) that aligns with the discussion in your paper (Supplementary Fig. S6). Moreover, certain regions even display an opposite pcOri sign (opposite A/B calls). While I comprehend that one of the reasons behind this phenomenon is the averaging of interactions from neighboring bins due to the use of coarse resolutions, resulting in inconsistent A/B compartment assignments, I remain uncertain about whether a higher resolution would indeed offer increased reliability.

I suggest sticking to 100Kb resolution and performing another dcHiC analysis at 40/50Kb resolution. You can then take the differential compartments found in both resolutions. This is a very conservative approach and if you like you can try out with 10Kb resolution and take the overlapping set. Effect of surrounding bins

This issue has made it challenging for me to decide which result is better, particularly given the significance of classifying regions into A or B status, which serves as an indicator of active and inactive chromatin. I believe integrating additional omics data may help, but sometimes it also makes things to be more complicate.

Yes, the addition of other data makes things a bit more complicated. You can also try out non-PCA based assignment of compartments (e.g. CscoreTool) as an independent method. You can also perform dcHiC once you have the compartment scores from CscoreTool. Then you can compare the PCA based and non-PCA based compartment assignment and their differential result to analyze your data.

ay-lab / dcHiC

Pairwise and multi-datasets A/B compartment calling with different resolutions #74