ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

The graph length of pangenome growth, especially when coverage =1 in panacus program. #1113

Open wangnan9394 opened 1 year ago

wangnan9394 commented 1 year ago

Hi,

I'm using cactus-minigraph following the workflows of pangenome: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md The single reference haplotype is about 800Mb and I have generated a graph pangenome (GFA). When I tested the pangenome growth using panacus program with a input sample list (the first sample is the single reference haplotype), I found the coverage =1 indicates 610 Mb sequences. I am not sure why the legth of first underlying graph is not equal to the length of singlr reference haplotype (800 Mb). Does 190Mb sequences were clipped in the graph?

Thank you so much.

Nan

glennhickey commented 1 year ago

I'm glad people are using panacus! I intend on eventually including it in the Cactus release and maybe running some of the growth curves automatically -- they are much nicer than the plots I've been making.

Anyway... the --reference genome is always included in its entirety in the output. Here's an example of how to verify this using vg and samtools on the S288C reference in the yeast example, which you can try on your data:

vg paths -Ex yeast.gbz | grep S288C | awk '{sum += $2} END {print sum}'
12157149
samtools faidx S288C.fa.gz
cat S288C.fa.gz | awk '{sum += $2} END {print sum}'
12157149

Perhaps you are misinterpreting the panacus output? I agree that 620Mb with coverage >= 1 would not make sense with your reference size. But with coverage == 1, it does hold up (ie 620Mb only present in one sample, which is a number that is not bounded by the reference length in any way).