ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
503 stars 110 forks source link

How to genotype the result file primates-pg.vcf.gz of minigraph_cactus? #954

Open abcyulongwang opened 1 year ago

abcyulongwang commented 1 year ago

Dear developer! I have a dozen high-quality genomes of diploid mammals, and I have obtained a graphical pan-genome through Minigraph-Cactus, which we think is accurate and perfect. Of course, we wanted further analysis, so we used Vg software to build a pan-transcriptome. After that , I ran vg autoindex --workflow mpmap --workflow rpvg --prefix vg_rna --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf. it got error like this

The output: [IndexRegistry]: Checking for phasing in VCF(s). error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA VCF are insufficient to create target index Haplotype-Transcript GBWT

I've tried adding the parameter --gfa ref.gfa with similar result:

[IndexRegistry]: Checking for phasing in VCF(s). [IndexRegistry]: Provided: VCF [IndexRegistry]: Checking for haplotype lines in GFA. [IndexRegistry]: Provided: Reference GFA error:[vg autoindex] Input is not sufficient to create indexes

According to other developers' suggestions, it is possible that our vcf file is not phasing, I observed our vcf result file and found that the results of the GT column are not in the usual 0|0, 0|1 or 1|1 format. Our GT column only has one number. Although this result is correct, it may not conform to the input file format of Vg.

image

So I would like to get your suggestion on how to solve this problem, is to re-run Minigraph-Cactus? Or are there other tools that can achieve phase.

Here is the link to the previous question for your reference: https://github.com/jonassibbesen/rpvg/issues/43

Sincerely hope to get your help! Best

glennhickey commented 1 year ago

To make a phased diploid VCF with minigraph-cactus, you need to use the sample names in the input seqfile as described here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#sample-names where you use ".1" and ".2" suffixes for the first and second haplotype of each sample. So chimp.1 and chimp.2 for chimp, gorilla.1 and gorilla.2 for gorilla etc.

Changing the names requires starting from scratch. If you don't want to rerun everything, you can try to edit the GFA output from cactus-graphmap-join to fix this information. The haplotype number is the 3rd column in the W(alk) lines. So if you have

W  chimp_hap1 0 ...
W  chimp_hap2 0 ...

You'd change this to

W  chimp 1 ...
W  chimp 2 ...

etc.

From there you can remake the gbz from the gfa with vg gbwt and then regenerate the vcf with vg deconstruct and it will be properly phased.