ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
503 stars 110 forks source link

Conversion of .pg to .vg format for use with VG #423

Open maxgmarin opened 3 years ago

maxgmarin commented 3 years ago

Hello,

I have successfully ran both the Cactus pipeline on a set of ~30 bacterial genomes, as well as on the mammals example data. Both times Cactus has output a .hal file that I can inspect and validate.

I was able to use hal2vg to convert the .hal alignment to the .pg format.

The issue I have run into is that none of the functionalities of vg (latest version 1.30) seem to accept the .pg format output by hal2vg.

Is there a straightforward way to convert .pg to .vg?

Alternatively, should I look into outputting a .odgi with hal2vg, then converting the .odgi to .gfa (and then .gfa to .vg)?

Thank you, Max

maxgmarin commented 3 years ago

Additionally, I noticed that for the SV genotyping paper (https://github.com/vgteam/sv-genotyping-paper), hal2vg was used to output a .vg format.

The rule used for that paper can be found from the snakemake file used here

rule hal2vg:
    input:
        "cactusoutput.hal"
    output:
        "yeast.vg"
    shell:
        "~/bin/hal2vg_fork/hal2vg --noAncestors --refGenome S288C {input} > {output}"
glennhickey commented 3 years ago

hal2vg output works fine with vg. Perhaps you have an outdated version of one of them?

example:

wget https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v1.2.3/cactus-bin-v1.2.3.tar.gz
tar xf cactus-bin-v1.2.3.tar.gz
wget https://github.com/vgteam/vg/releases/download/v1.30.0/vg
chmod +x vg

cactus-bin-v1.2.3/bin/halRandGen rand.hal
cactus-bin-v1.2.3/bin/hal2vg rand.hal > rand.vg
./vg paths -Ev rand.vg
Genome_16.Genome_16_seq 436644
Genome_13.Genome_13_seq 223992
Genome_14.Genome_14_seq 112716
Genome_3.Genome_3_seq   150518
Genome_9.Genome_9_seq   161504
Genome_17.Genome_17_seq 242916
Genome_2.Genome_2_seq   284130
Genome_19.Genome_19_seq 154714
Genome_18.Genome_18_seq 585488
Genome_8.Genome_8_seq   720948
Genome_0.Genome_0_seq   141933
Genome_6.Genome_6_seq   196470
Genome_4.Genome_4_seq   139629
Genome_15.Genome_15_seq 572700
Genome_10.Genome_10_seq 771630
Genome_12.Genome_12_seq 219286
Genome_11.Genome_11_seq 752640
Genome_7.Genome_7_seq   853905
Genome_1.Genome_1_seq   828696
Genome_5.Genome_5_seq   476136
halStats rand.hal

hal v2.1
(((Genome_14:0,Genome_15:0)Genome_9:0)Genome_1:0,((Genome_16:0)Genome_10:0)Genome_2:0,Genome_3:0,Genome_4:0,(Genome_11:0)Genome_5:0,Genome_6:0,((Genome_17:0,Genome_18:0)Genome_12:0)Genome_7:0,((Genome_19:0)Genome_13:0)Genome_8:0)Genome_0;

GenomeName, NumChildren, Length, NumSequences, NumTopSegments, NumBottomSegments
Genome_0, 8, 141933, 1, 0, 253
Genome_1, 1, 828696, 1, 1478, 473
Genome_9, 2, 161504, 1, 93, 196
Genome_14, 0, 112716, 1, 137, 0
Genome_15, 0, 572700, 1, 696, 0
Genome_2, 1, 284130, 1, 507, 210
Genome_10, 1, 771630, 1, 571, 445
Genome_16, 0, 436644, 1, 252, 0
Genome_3, 0, 150518, 1, 269, 0
Genome_4, 0, 139629, 1, 249, 0
Genome_5, 1, 476136, 1, 849, 389
Genome_11, 0, 752640, 1, 615, 0
Genome_6, 0, 196470, 1, 351, 0
Genome_7, 1, 853905, 1, 1523, 435
Genome_12, 2, 219286, 1, 112, 166
Genome_17, 0, 242916, 1, 184, 0
Genome_18, 0, 585488, 1, 444, 0
Genome_8, 1, 720948, 1, 1286, 438
Genome_13, 1, 223992, 1, 137, 183
Genome_19, 0, 154714, 1, 127, 0