ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
55 stars 10 forks source link

The difference between calculations for weighted center across replicates or samples #61

Closed dangdachang closed 1 year ago

dangdachang commented 1 year ago

Hello, I'm confused that the calculations for getting weighted center across replicates or samples should be different when considering the cumulative normal distribution probability associated with the maximum z-score, according to your paper published in Nature communications. But in the code, the calculations are same. Is there something wrong? image In details, the codes in dchicf.r lines 767-769 and lines 782-784 are same for replicate and sample.

ay-lab commented 1 year ago

Yes, you're right. The segment looks the same. The weighted centers provide more weight to the points distant from others among the samples (further from the diagonal) than to points closer together in the multidimensional space (close to the diagonal). While for replicates, this approach provides more weight to the features closer to each other within replicates of a sample (close to the diagonal) than the calculation across different samples. Although this code looks the same and will give more weight to the points distant from others (or maybe the opposite), the Chi-square p-value calculated from the mahalanobis distance is rectified for this. The lower tail p-values are estimated for replicates, while for samples, it is changed to the upper tail (lower.tail=T/F option in pchisq). This has the same effect as that of multiplying the sample positions with (df df_pval_max) instead of (df 1-df_pval_max) and keeping the pchisq lower.tail option same for both cases.

dangdachang commented 1 year ago

I appreciate your timely and detailed answer, but I still have some confusion.

  1. The original version of the Mahalanobis distance uses the mean of all samples (that is, the mean of all rows in df), but your modified Mahalanobis distance uses the weighted center, which seems to use the information of only one sample, what is the reason for this? image

  2. My understanding of dcHiC is that we want to pick out some bins that have different PC1 values across samples (away from the diagonal). dcHiC selects points with large Mahalanobis distances from the mean and uses a chi-square test to get their statistical significance (upper tail). In this way, the value of the Mahalanobis distance has a specific meaning, then apparently in determining the weighted center, using (df df_pval_max) leads to small Mahalanobis distances, and using (df 1 - df_pval_max) leads to large Mahalanobis distances, which seems not to be related to the use of lower or upper tail in pchisq. In other words, I think that when not considering the replicate, only focus on finding bins with significant differences between different samples, we should use (df * 1-df_pval_max) to determine the weighted center, and use upper tail in the pchisq. But this seems to be different from what is described in your paper. Is there a mistake in my understanding? image image

Thank you very much for your answer, and I look forward to receiving your reply.

ay-lab commented 1 year ago

So for your first question, it uses the highest Z-score per HiC bins (i) across samples to calculate the weight. It tries to penalize all the points (HiC bins) distant from others based on that particular outlier Z-score from one sample. For your second query, if you look at the pnorm function, I have set the lower.tail option to True. The equation in the paper is written assuming the weight is calculated setting lower.tail to False (so that you penalize more with a higher Z-score i.e. the outlier), but in the code it was set to True. so I have used (1 - df_pval_max) to calculate the weight. Basically, if the Z-score is, let's say 2 then pnorm(2, lower.tail=T) is 0.9772499 and pnorm(2, lower.tail=F) is 0.02275013. I am just then using 1 - pnorm(2, lower.tail=T) to obtain the weight. So, weight = pnorm(Zscore, lower.tail=F) = (1-pnorm(2, lower.tail=T)). So, depending on the lower.tail option set as either True or False, we can either use (df 1-df_pval_max) or (df df_pval_max) to calculate the weighted center. A mahalanobis distance is ultimately related to the use of lower or upper tails in pchisq in the sense of how we will interpret it. A large mahalnobis distance with upper tail pchisq will give a low significance value meaning that it is an outlier in the sample space.

dangdachang commented 1 year ago

Thanks a lot for answering my questions. Additionally, I have another question. The original Mahalanobis distance can be expressed as a sum of variables following a standard normal distribution. Thus, it can be regarded as a chi-squared statistic and tested for significance using a chi-squared test. However, in your modified version of the Mahalanobis distance, the mean of all data points has not been subtracted. Can it still be used as a chi-squared statistic and tested for significance using a chi-squared test?

ay-lab commented 1 year ago

The compartment scores are calculated from the distance-normalized Z-transformed (mean=0, sd=1) interaction matrix, and the final scores themselves have a mean value of 0. That's why they are not subtracted.

dangdachang commented 1 year ago

Thanks a lot for your answer. But I think the compartment scores are calculated from the distance-normalized Z-transformed correlation matrix, maybe not the interaction matrix. And the compartment scores are PC1 values, the interaction matrix with (mean=0, sd=1) may not guarantee that they have a mean value of 0. Maybe I can check the mean value of compartment scores later. Thank you and I have no more questions.