ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
528 stars 111 forks source link

Minigraph-cactus VCF has overlapping variants after left align #1460

Open Han-Cao opened 3 months ago

Han-Cao commented 3 months ago

Hi,

In current MC pipeline, the default output VCF (i.e., after vcfbub and bcftools norm) may have overlapping variants due to left align.

For example:

In mc.raw.vcf.gz, there is no overlap between the below 2 variants

chr1    702357  >31615734>31615730      AGA     AAA,A   60      .      AT=>31615734>31615732>31615731>31615730,>31615734<31615733>31615731>31615730,>31615734>31615730
chr1    702369  >31615730>31615728      A       AA      60      .      AT=>31615730>31615728,>31615730<31615729>31615728

After running bcftools norm -f GRCh38.fa, the output variants are overlapping:

chr1    702356  >31615734>31615730      AAG     AAA,A   60      .      AT=>31615734>31615732>31615731>31615730,>31615734<31615733>31615731>31615730,>31615734>31615730
chr1    702358  >31615730>31615728      G       GA      60      .      AT=>31615730>31615728,>31615730<31615729>31615728

Do you think these 2 records should be merged into one multiallelic record? Moreover, as overlapping variants are not accepted by some tools (e.g., pangenie), this could be an issue for downstream analysis.

glennhickey commented 3 months ago

I see what you mean. Is there a bcftools command (or other tool) that can do this?

Han-Cao commented 3 months ago

I tried bcftools norm -m +any, but it cannot merge these records (v1.20). I am not sure whether other tool can do it.

What I currently do is skipping bcftools norm -f to generate a non-overlapping multiallelic VCF (like old cactus) for tools require non-overlapping input.

minglibio commented 3 months ago

Same issue here @glennhickey . The variants are even in the same position.

1       62173   >4733>4736      G       A       60      .       AC=3;AF=0.3;AN=10;AT=>4733>4734>4736,>4733<4735>4736;NS=8;LV=0  GT      0|.     0|.     .|.     1|.     .       .|.     .|.     .|.     .|.     0|0     .|.     1|1     .|.     0       .       0       0|.
1       62173   >4736>4767      GTGGTGGCACAGCCGTGATCACAGACTCAGGGTGATGTGGGTCCCCATGGTGGCACAGCCGTGACCATGACCTCAGGGTGACGTGGGTCCCCA   G,GTGGTGGCACAGCCGTGACCATGACCTCAGGGTGCTATGGGTCCCCATGGTGACACCACCACGACAACCACGGCCTCAGGGTGACGTGGGTCCCCA      60      .       AC=7,2;AF=0.7,0.2;AN=10;AT=>4736>4737>4738>4740>4741>4743>4744>4746>4747>4749>4750>4752>4753>4755>4756>4758>4760>4761>4763>4764>4766>4767,>4736>4767,>4736>4737<4739>4740<4742>4743<4745>4746<4748>4749<4751>4752<4754>4755<4757>4758<4759>4760<4762>4763<4765>4766>4767;NS=8;LV=0    GT      0|.           1|.     .|.     1|.     .       .|.     .|.     .|.     .|.     1|1     .|.     2|2     .|.     1       .       1       1|.
glennhickey commented 2 months ago

There's a tool in vcflib, vcfcreatemulti that seems to be able to merge overlapping variants. It's included in the Cactus docker release (but not binary release). I'd be curious to know how that works out. In the meantime, I'm leaning towards disabling the normalizer by default in the next release.