ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
511 stars 112 forks source link

Combine level-0 variants with level-1 in vcf generated from cactus-pangenome #774

Open RenzoTale88 opened 2 years ago

RenzoTale88 commented 2 years ago

Hello, I'm trying to apply bcftools consensus using a cactus VCF file, but coming across overlapping variants in the process. I can see that the overlapping variants correspond to several level-1 variants following a large level-0 variant, like this one:

1       156983  >415>494        CATTCACTGATCACGTGGCTGATCATGCACTGATCATATGGCAATCATGCACTGATCACGTGTCTTATCATGCACTGATCACGTGGCTGATCATACAATGATAAGGTGGCTGATCATGCACTAATCACTTTGCTTATCATGCACTGATCAGGTGGCTATCATGCAC  TATTCACTGATCACGTAGGTGATCATGCACTGATTATGTGGCTGATCATACACTGATCATGTGACTGATTATGCACTAATCACGTTGGGGATCATGCACTGATCATGTGGCTGATCATACACTTATCACGTGATGGATCATGCACTAGACACATGGCTATCATGAAT 60      .       AC=2;AF=1;AN=2;AT=>415>416>418>419>421>422>424>425>427>428>430>431>433>434>436>437>439>440>442>443>445>446>448>449>451>452>454>455>457>458>460>461>463>464>466>467>469>470>472>473>475>476>478>479>481>482>484>485>487>488>490>491>494,>415>417>492>493>494;NS=1;LV=0   GT      1|1
1       156999  >418>421        G       A       60      .       AC=1;AF=1;AN=1;AT=>418>419>421,>418>420>421;NS=1;LV=1;PS=>415>494       GT      0|1
1       157001  >421>424        C       G       60      .       AC=1;AF=1;AN=1;AT=>421>422>424,>421>423>424;NS=1;LV=1;PS=>415>494       GT      0|1
1       157017  >424>427        C       T       60      .       AC=1;AF=1;AN=1;AT=>424>425>427,>424>426>427;NS=1;LV=1;PS=>415>494       GT      0|1

Is there a way to combine these type of sites into larger multiallelic sites? Sorry for the slightly bizarre question, and thanks in advance for the help!

Andrea

glennhickey commented 2 years ago

You need to use seomthing like https://github.com/pangenome/vcfbub in order to remove nested variants. This definitely needs a boost in the documentation.

RenzoTale88 commented 2 years ago

@glennhickey thanks for the quick reply. I thought vcfbub just remove those variants which are not on level-0 in the vcf file?

glennhickey commented 2 years ago

Yes, but it allows you to specify a maximum bubble size, which lets you remove level-0 variants that are too big and keep only the children.

You never want both a level N and level N+1 variant from the same site in your VCF, as they contain redundant information.

But... you may want to use some kind of decomposition to clean up the big bubbles. We tried a few approaches in https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1.abstract that you can find digging through the methods.

The most extreme is just to realign everything to the reference to try to get a simpler VCF. This is effective, but you have to keep in mind that the variants in your VCF no longer correspond to the topology of the graph: https://github.com/vcflib/vcflib/blob/master/doc/vcfwave.md

RenzoTale88 commented 2 years ago

Oh the re-alignment would be good actually. I don't need to use the graph topology as such since I want to feed the variants to bcftools consensus. So the simpler the better. Thanks so much for this, I'll give vcfwave a go!