ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
57 stars 10 forks source link

Question about multiple comparison method #15

Closed kalavattam closed 3 years ago

kalavattam commented 3 years ago

Hi, can you help me to better understand the method for multiple comparisons performed by dcHiC? My goal is to better understand how a single p-value is derived for each individual bin of the genome in examples in which >2 Hi-C datasets are analyzed.

From what I understand, first, the Hi-C maps are concatenated, then Multiple Factor Analysis is performed on the concatenated map. So this means, for example, an analysis of four biological samples results in four partial factor scores for each bin—is that correct?

These scores are used to derive a multivariate distance measure, the Mahalanobis distance. The distance measure detects outliers in scores among all samples. In the example analysis of four biological samples, there would be four score variables per each bin—is this correct? If one of those is detected as an outlier, then its significance is calculated using the weighted distance and the critical Chi-square distribution. Is that correct?

I appreciate any help you can provide. Thank you!

ay-lab commented 3 years ago

Hi There,

Thank you for this question! You have captured the core essence well. dcHiC first uses MFA to normalize the samples (taking the partial factor scores) and then uses two multivariate statistical measures for significance calling. The first of these is Mahalanobis distance, an outlier-detection measure that can be thought of as a multivariate z-score (chi-squared distributed). This produces a set of "naive" p-values. The application then boosts power of detection with a biological replicate variability score—a measure for how much variability is in a particular set of PC values compared to variability between PC values of biological replicates. These parameters are inferred by dcHiC from user-designated replicates or can be taken from a pre-trained file that we provide. The Mahalanobis distance p-values and biological replicate variability score are combined in an Independent Hypothesis Weighting (a weighted B-H) for the final results.

I would highly recommend that you check out Figure 1 and its associated sections of our pre-print for more details.