c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

hic interactions statistics #78

Open AlcaArctica opened 11 months ago

AlcaArctica commented 11 months ago

I am running the arima pipeline, followed by yahs and then juicer pre (as described in this repository) to generate the required out_JBAT.hic file for manual curation in juicebox. This is my first try, but I am happy with the resulting map and will probably implement this workflow in the future again. However, I am wondering how I could generate statistics about the quality of the hic interactions? It seems that people who use juicer to create their .hic file get some stats file along with their other results with information similar to this:

Inter-chromosomal: 1,320,146 (0.51% / 0.93%)
Intra-chromosomal: 7,458,303 (2.87% / 5.27%)
Short Range (<20Kb): 4,571,216 (1.76% / 3.23%)
Long Range (>20Kb): 2,886,831 (1.11% / 2.04%)

How can I obtain a similar statistic for my data with the described workflow (arima - yahs - juicer pre)? Thank you very much

AlcaArctica commented 11 months ago

Also appreciate if you can point me to any other tools suitable for assessing the quality of my hic interactions. I am using the arima 4 enyzme kit, if that is relevant.

c-zhou commented 11 months ago

Hello @AlcaArctica,

YaHS does give you Inter-chromosomal and Intra-chromosomal read pair counts during running if you check your log file. However, that is for contigs, i.e., before scaffolding. So it is more like Inter-contig and Intra-contig.

If you need accurate numbers for these statistics, I would suggest remapping the hic data to your final chromosomes and using tools such as samtools to do the counting. The 9th column of the SAM file is what you need, i.e., the TLEN field - observed Template LENgth. See section 1.4 of [this document] (https://samtools.github.io/hts-specs/SAMv1.pdf).

Best, Chenxi

AlcaArctica commented 10 months ago

Thank you, @c-zhou . I will investigate further!