ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

same position error in vcf output file #1477

Closed shin0727 closed 2 months ago

shin0727 commented 2 months ago

In order to construct the graphical genome for my datasets, i used following commands and then get the ${PREFIX}.vcf.gz output file commands : cactus-pangenome ./jobstorepath ./sequenceFile.tsv --outDir ${PREFIX1} --outName ${PREFIX1} --reference ${REF} --filter 9 --giraffe clip filter --vcf --viz --odgi --chrom-vg clip filter --chrom-og --gbz clip filter full --gfa clip full --vcf --giraffe --gfa --gbz --chrom-vg --logFile ${PREFIX1}.log

but it contains 116,343 same position errors like below image

I want to know if it is error, and if it is not, i want to know why there variants are written in 2 rows?

glennhickey commented 2 months ago

I think this is the same problem as #1460, and is related to the fact that I added bcftools norm to the vcf exporter by default. Apparently the left-shifting done by this tool can result in variants being put at the same position. While the VCF should still be technically valid, some tools won't like it including, as the previous issue points out, PanGenie.

I haven't had a chance to figure out how to merge these variants yet but hope to have a solution in the next release.

In the meantime, you can revert to the old way of not normalizing by doing something like

sed src/cactus/cactus_progressive_config.xml -e 's/bcftoolsNorm="1"/bcftoolsNorm="0"/g' > config-nonorm.xml

Then running any cactus commands with --configFile config-nonorm.xml