ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

How do I create d2 d8 d10 filtered graph pangenomes from a clip graph pangenome? My steps might be incorrect. #1469

Closed zhangyixing3 closed 3 months ago

zhangyixing3 commented 3 months ago

Dear Sir, I compared the d2 index obtained by default from minigraph cactus with the d2 index I built using vg, but my graph pangenome results seem incorrect.

vg clip 97samples.vg -d 2 -P 035_Col_CEN_v12 -m 1000 |  /
      vg clip -d 1 - -P 035_Col_CEN_v12  |    /
      vg clip -sS - -P 035_Col_CEN_v12 > 97samples.d2.vgclip.vg

vg convert -f 97samples.d2.vgclip.vg   > 97samples.d2.vgclip.gfa
I check the first few Walk lines
$ grep "^W"  97samples.d2.vgclip.gfa  | cut -f 1-6 | head 
W   001_6137    0   Chr1    196 19144
W   001_6137    0   Chr1    19145   19946
W   001_6137    0   Chr1    19947   31063
W   001_6137    0   Chr1    31064   41815
W   001_6137    0   Chr1    41816   50770
W   001_6137    0   Chr1    50771   137723
W   001_6137    0   Chr1    137724  210054
W   001_6137    0   Chr1    210055  215043
W   001_6137    0   Chr1    215044  251367
W   001_6137    0   Chr1    251368  259882

The above result is inconsistent with the default output of minigraph cactus

$ grep "^W"  97samples.d2.gfa | cut -f 1-6 | head 
W   001_6137    0   Chr1    210055  215043
W   001_6137    0   Chr1    294553  294630
W   001_6137    0   Chr1    215044  251367
W   001_6137    0   Chr1    294631  299803
W   001_6137    0   Chr1    299804  301783
W   001_6137    0   Chr1    301784  303176
W   001_6137    0   Chr1    303177  305154
W   001_6137    0   Chr1    305154  307379
W   001_6137    0   Chr1    307380  313233
W   001_6137    0   Chr1    313233  315367
# This is clip  pangenome
$ grep "^W"  97samples.gfa | cut -f 1-6 | head 
W   001_6137    0   Chr1    196 7158713
W   001_6137    0   Chr1    7169501 10859158
W   001_6137    0   Chr1    10880474    11484706
W   001_6137    0   Chr1    11519086    11707335
W   001_6137    0   Chr1    11720727    12226079
W   001_6137    0   Chr1    12238998    12350168
W   001_6137    0   Chr1    12367152    12399531
W   001_6137    0   Chr1    12412025    12563687
W   001_6137    0   Chr1    12574826    12845858
W   001_6137    0   Chr1    12860246    12982947

The differences between 97samples.d2.gfa and 97samples.d2.vgclip.gfa derived from the same clip pangenome are significant. Am I missing some steps in creating the filtered pangenome? I found that the filtered pangenome index I created cannot be operated on by odgi (neither odgi paths nor odgi stats work), while the default d2 pangenome works fine.

glennhickey commented 3 months ago

Your steps look okay. If I try to do something similar on the yeast test data included in cactus

cactus-pangenome ./js ./examples/yeastPangenome.txt --outDir yeast-filter-test --outName yeast --reference S288C  --gfa clip filter
cd yeast-filter-test
zcat yeast.gfa.gz | vg clip -d 2 - -P S288C -m 1000 | vg clip -d 1 -P S288C - | vg clip -sS - -P S288C > yeast.myd2.gfa
zcat yeast.d2.gfa.gz > yeast.d2.gfa

Then the two files have equivalent stats

vg stats -lz yeast.d2.gfa 2> /dev/null
nodes   452349
edges   551752
length  12417384

vg stats -lz yeast.myd2.gfa 2> /dev/null
nodes   452349
edges   551752
length  12417384

along with path names and lengths

vg paths -Ex yeast.d2.gfa 2> /dev/null | sort > yeast.d2.paths
vg paths -Ex yeast.myd2.gfa 2> /dev/null | sort > yeast.myd2.paths
diff -s yeast.d2.paths yeast.myd2.paths
Files yeast.d2.paths and yeast.myd2.paths are identical
zhangyixing3 commented 3 months ago

Thanks for your quickly reply,I'll check my steps.

zhangyixing3 commented 3 months ago

You were right, 97samples.d2.gfa and 97samples.d2.vgclip.gfa do have equivalent stats. I just forgot to sort the paths. And yesterday, when I was using odgi, I completely forgot to convert the GFA 1.1 to GFA 1.0. I feel like such an idiot. Thank you !