ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
62 stars 10 forks source link

Correct the A/B compartment #34

Closed Nuturetree closed 2 years ago

Nuturetree commented 2 years ago

Hi author: I correct the A/B compartment by use order "Rscript /public/home/xhhuang/biosoft/dcHiC/dchicf.r --file input_f.txt --pcatype select --dirovwt T --gfolder Gbar_100000_goldenpathData --genome Gbar". But got an unexpected result and I think this result is wrong. Because both ends of the chromosome are rich in many active genes, very high probability of being A compartment region, the result presents it with B compartment status. I run cworld and hicexplorer to compare the result. The result of cworld and hicexplorer are more consistent and as expected. Although, I am not sure about the exact correction algorithm, which may be related to GC content and gene location, I think there is a problem in correcting A/B compartment of cotton, does it occur similarly in other non-model organisms or in anther plant?
I have more questions and I hope the author will understand. I would very much like this software to be applicable to a wider range of scenarios.
I sent you the reference genome of cotton by email, I hope it will help to improve the software. Nuturetree chr1_compartment

ay-lab commented 2 years ago

Thanks for sending the files, I just rerun the PC selection step. Here are my suggestions.

A-compartments in the mammalian genome are highly correlated with GC-content and the number of TSS in a given window. So, the selection of PC components is done based on two factors. A PC should have the highest correlation with the GC content and TSS region (It should also be anticorrelated with the chromosome length). dcHiC uses a combined score of GC content and TSS region correlation and select the maximum score to find the best PC out of the full set. If you look at the DPA0_cor.txt, DPA5_cor.txt files you will see the respective correlation values gcc.cor, tss.cor, len.cor and the combined score (ignore the len.score). For example, Gbar_A01 from DPA0 has the following values (group is equal to PC number)

group   chr     gcc.cor tss.cor len.cor score
1       Gbar_A01        0.2238  -0.7163 0.2063  -0.4925
2       Gbar_A01        0.1707  -0.2265 0.0253  -0.0558

Now, in the above example, PC1 has a higher gc correlation which is higher than PC2 (both with a positive sign) but has a very low tss correlation compared to PC2 tss correlation. Thus the combined score is minimum for PC1 and thereby PC2 is selected upon PC1. The scenario is true for other chromosomes also. This is not the case for a mammalian genome, a high gc content is almost always associated with high tss number. Also (from the perspective of a mammalian genome) looking at the figure for the distribution of A/B compartments dcHiC compartment seems much more legit compared to the others. The other two seem more correlated with the chromosome length.

BUT, of course, you're the expert here and should have more knowledge about this genome in depth. I can help you to resolve this issue. There is a script reselectpc.r under 'utility' which you can use to select your custom PC at ease. For example, to reselect the PC1 for Gbar_A01 chromosome in DPA0 sample just run the following -

Rscript reselectpc.r --reselect man --sample DPA0_100kb_pca --chr Gbar_A01 --pc PC1 --pctype intra

There is an option where you can utilize the existing PC values calculated from other programs. dcHiC will the existing PCs to find the differential compartments. To use this option us the getcHiCinputfromExistingPCs.r script under 'utility' folder. This is the help page of the script -


Options:
        --input=INPUT
                This is a five column text file with the following fields

                <pc bedGraph path>      <pc type>       <replicate name>        <sample name>

                First column is the path to the bedGraph file with PC information (4th column of the bedGraph file)
                Second column can have either intra or inter
                Thrid column is the replicate name
                Fourth column is the sample name

                Note: replicate names and sample names should be different

        -h, --help
                Show this help message and exit

For this case, your input.txt file should look like the following

Gbar_A01.program.PC1.bedGraph    intra   DPA0_100Kb   DPA0
Gbar_A02.program.PC1.bedGraph    intra   DPA0_100Kb   DPA0
......

I hope these solutions will help to resolve the issue. Let me know how it goes.

Nuturetree commented 2 years ago

Hi author Thank you for your detailed reply! I think you are very wise to leave an interface that can combine the results of other software. I use the result produce by cworld as the input file to analysis the different A/B compartment. But the result show that some regions with same A/B compartment status consider as have switching between A/B compartment in *.Filtered.pcOri.bedGraph file. Whether the cworld results are not used as input to dcHiC or the results identified by dcHiC need to be manually filtered again. I have attached the results of cworld and dcHiC. DPA0_compartment.txt DPA5_compartment.txt DPA10_compartment.txt DPA20_compartment.txt

chr start end DPA0 DPA5 DPA10 DPA20 sample_maha pval padj dist_clust Gbar_A02 43900000 44000000 -0.03751 -0.03696 -0.0218 -0.00288 35.2983407239606 1.05 Gbar_A02 44700000 44800000 -0.03747 -0.03614 -0.0245 -0.00048 37.7308774679895 3.22

Nuturetree

ay-lab commented 2 years ago

Yes, dcHiC can detect differential compartment within the same compartment. A similar region with same compartment label can vary in terms of its magnitude of compartment score which can be detected as differential. This is a feature of dcHiC. The original contribution of dcHiC is to find the differential compartments across a group. So, a flexibility to integrate other softwares which only calculates the principal components will always be helpful to the users. Thanks for trying out the method.