ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
55 stars 10 forks source link

How to extract compartment A/B information for each sample #53

Closed KunFang93 closed 1 year ago

KunFang93 commented 1 year ago

Hi,

I would like to get compartment A/B information for each sample. I noticed that in xx_resolution/intra_pca/sample_res_mat folder, each chromosome has 10 files, e.g.,

[kun@G1400PNG-AP02LP NT1_100000_mat]$ wc -l chrX*
    1554 chrX.bed
    1491 chrX.cmat.txt
    1522 chrX.distparam
    1491 chrX.PC1.bedGraph
    1491 chrX.PC2.bedGraph
    1491 chrX.pc.bedGraph
    1492 chrX.pc.txt
    1491 chrX.precmat.txt
     108 chrX.svd.rds
  912204 chrX.txt

I wondered how could I extract compartment A/B information from these files? Thanks for your help!

Best, Kun

KunFang93 commented 1 year ago

I found that there are files named _differential.intra_sample_chrXXcombined.pcQnm.bedGraph in DifferentialResult/PT_RT_100000/fdr_result folders. I wondered if I can use the values in these bedGraph to infer Compartment A/B for each samples? For example,

(dchic) [kun@G1400PNG-AP02LP fdr_result]$ head differential.intra_sample_chr1_combined.pcQnm.bedGraph
chr start   end PT1_100000  PT2_100000  PT3_100000  PT4_100000  PT5_100000  RT1_100000  RT2_100000  RT3_100000  RT4_100000  RT5_100000  PTRT    replicate_wt    sample_maha pval
chr1    0   100000  **0.19785** **-0.63275**    0.23013 0.23188 0.46472 0.59767 -0.15075    0.72631 0.31756 0.69927 0.098366    0.438012    17.0283230046668    0.0665091336067877  0.796488987359423
chr1    100000  200000  -0.15469    -0.57651    0.05701 -0.12082    0.31452 0.67771 -0.32058    0.69927 0.28352 0.48196 -0.096098   0.364376    14.6515358501899    0.321070738534486   0.570964879780533
chr1    200000  300000  -0.11972    -0.44822    0.31555 -0.06539    0.65033 0.49692 -0.09224    0.65033 0.39607 0.48794 0.06651 0.387804    23.7918212822259    0.0481633873177334  0.826290501120561
chr1    500000  600000  0.01092 -0.58774    0.45031 0.03114 0.5769  0.6126  -0.06539    0.7413  0.61868 0.58627 0.096306    0.498692    17.6671098579832    0.166739999639082   0.683025475145062
chr1    600000  700000  -0.06135    -0.60609    0.28428 0.10637 0.57016 0.50297 -0.23014    0.80647 0.51995 0.67938 0.058674    0.455726    13.5847586113175    0.153437037901881   0.695272204414064
chr1    700000  800000  0.71038 -0.24246    0.91601 0.59576 1.06156 1.365   0.40538 1.35958 1.17846 1.24159 0.60825 1.110002    6.86424456943585    0.611888238837095   0.434077739403898
chr1    800000  900000  1.22411 -0.30698    1.42875 0.88449 1.34581 1.57203 0.65253 1.50606 1.44276 1.71208 0.915236    1.377092    3.25609257789608    0.47066216300425    0.492682700247561
chr1    900000  1000000 1.70949 -0.19608    1.74719 1.37552 1.63569 1.83472 1.28828 1.66303 1.62553 2.01513 1.254362    1.685338    2.54184234207455    0.404123603910036   0.524967303293424
chr1    1000000 1100000 1.59568 -0.2504 1.62885 1.4257  1.77502 2.03347 1.01516 1.67079 1.60472 1.65377 1.23497 1.595582    1.92957884015547    0.174371347027042   0.676255718773047

chr1 0 100000 for PT1_100000 is a compartment A since this bin has positive pc value 0.19785 while the next bin chr1 100000 200000 has a negative pc value -0.15469? Thanks for your help!

Best, Kun

ay-lab commented 1 year ago

Hi Kun

You can use the differential.intra_sample_chrXX_combined.pcOri.bedGraph file to extract the compartments. The pcQnm files represent the quantile normalized compartment scores which are only used to compare the scores across samples and to derive the significance internally. The pcOri files represent the original scores that represent A(+ve values) and B(-ve values) compartments.

KunFang93 commented 1 year ago

Got it. Thanks for your prompt reply! I am fresh to compartment analysis, so please forgive me if this is a dumb question : for counting the number of differential compartment, do we count the number of bins with padj less than cut off? Or we combine bins with same sign of pc value first as a compartment and then check if there is any bin in the combined region has the padj less than cut off, and finally count the number of differential bins? From the Fig.3A in the paper, I guess it would be the first one, counting bins? But how about count the number of compartment A/B, do I need to combine bins with same pc sign first and then count their number? Thanks for your help and time again~

ay-lab commented 1 year ago

This is a very interesting question. Given how we formulated the problem which is to compare the compartment score of a Hi-C bin across multiple samples, we needed to find the padj values for each bin separately. Combining the adjacent bins with the same compartment scores within each sample will certainly not give you an equal-sized region for proper comparison across multiple samples. So, to find the differential compartment we count the number of bins with padj less than cut off.

The second option that you suggested is more likely to give you a biologically interesting region. A significantly different region can be part of a continuous stretch of either A or B region where the other bins may not pass the padj threshold. Such cases may reflect a gradual change in the compartment scores across samples. For example, you can look at Fig. 2K Dach1 region in the paper. In the NPC sample, the PC value gradually changes from B to A and so thus the padj values. At some point, it crosses the Padj threshold and we call it significant.

KunFang93 commented 1 year ago

Got it, it makes a lot of sense. Thanks for your help!

katecycho commented 7 months ago

Hi Kun

You can use the differential.intra_sample_chrXX_combined.pcOri.bedGraph file to extract the compartments. The pcQnm files represent the quantile normalized compartment scores which are only used to compare the scores across samples and to derive the significance internally. The pcOri files represent the original scores that represent A(+ve values) and B(-ve values) compartments.

Question about this: I realize that there are files with "intra_chr#_combined.pcOri.bedGraph" for individual chromosomes, as well as "differential.intra_sample_group.Filtered.pcOri.bedGraph". Is the latter the merged file of all individual chromosomes that is then filtered with p-value cut off? What is the recommended or default p-value/p-adj cutoff? Thank you so much!