Duplicated variants in a pangenome VCF

Han-Cao commented 1 month ago

Hi,

In the output VCF of MC pipeline, there are a few duplicated variants after left-algin. Below is one example from the HPRC VCF:

Before left-align (bcftools view -r chr22:15409352-15409435 hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz):

chr22   15409352        >45360363>45360365      C       CCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAAAGTGTTTGTCC        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<45360364>45360365;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
chr22   15409435        >45360367>45360369      A       AAAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360367>45360369,>45360367>45360368>45360369;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

After bcftools norm -f ref.fa, these 2 insertions are exactly the same:

chr22   15409341        >45360363>45360365      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<4536
0364>45360365;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     
0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
chr22   15409341        >45360367>45360369      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA        60      .       AC=1;AF=0.0113636;AN=88;AT=>45360367>45360369,>45360367>4536
0368>45360369;NS=45;LV=0    GT      0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|1     0|0     
0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

Does is mean these 2 are the same insertion?

As the genotypes of these 2 insertions are different, to remove duplicated records from the VCF, should I keep one of them (like bcftools norm --rm-dup exact) or merge their genotypes (i.e., 2 AC=1 records merge into 1 AC=2 record)?

Thanks a lot!

glennhickey commented 1 month ago

You can try merging them with vcfcreatemulti. This tools is part of vcftools and should be available in the cactus docker image.

Han-Cao commented 1 month ago

I have tried to use vcfcreatemulti to merge them. The output is:

chr22   15409341        >45360363>45360365      G       GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA,GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA  60      .       AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<45360364>45360365;NS=45;LV=0;combined=15409341-15409341 GT      0       0|0     0|0     0|0     0|0     0|0     0|0    0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     .|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0    0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0

But there are 2 problems:

it create a multi-allelic record with the same ALT alleles
the genotype of the second variant is not merged (maybe a bug for duplicated alleles?)

I was planning to do something like:

chr1    100    var1    G    GAA    0|0    0|1    1|0
chr1    100    var1    G    GAA    0|0    1|0    0|0

merge into

chr1    100    var1    G    GAA    0|0    1|1    1|0

If you think this strategy is OK to handle duplicated variants in MC output VCF, I can write some script to do it.

Many thanks.

glennhickey commented 1 month ago

That's disappointing about vcfcreatemulti. I was hoping it was to do something like what you said you were planning to do, which seems fine. If you do come up with a general script for doing this, please share and I will look at incorporating it into cactus...

Han-Cao commented 1 month ago

I just wrote a prototype python script. You can find it here

It is now designed for phased VCF only:

check duplicate by CHROM, POS, REF, ALT
merge phased haplotypes: 1|0 + 0|1 -> 1|1, save merged variant ID in INFO/DUP_ID
don't merge 2 variants if genotype conflict on any haplotype: 1|0 + 1|0 -> no merge
raise warning when missing genotype merged with non-missing genotype: 1|0 + 0|. -> 1|0 with warning in log

For below VCF:

#CHROM  POS ID  REF ALT INFO    FORMAT  CHM13   Sample1 Sample2 Sample3
chr1    5   var1    A   TTT AN=7;AF=0.285714;AC=2   GT  0   0|0 1|0 0|1
chr1    5   var2    A   TTT AN=7;AF=0.142857;AC=1   GT  1   0|0 0|0 0|0
chr1    5   var3    A   TTT AN=6;AF=0;AC=0  GT  .   0|0 0|0 0|0
chr1    5   var4    A   TTT AN=7;AF=0.142857;AC=1   GT  0   0|0 1|0 0|0
chr1    5   var5    A   TTT AN=7;AF=0.285714;AC=2   GT  0   0|0 0|1 0|1
chr1    5   var6    A   C   AN=6;AF=0.333333;AC=2   GT  0   0|0 1|0 .|1
chr1    6   var7    A   TTT AN=6;AF=0.333333;AC=2   GT  0   0|0 1|0 .|1

It output

#CHROM  POS ID  REF ALT INFO    FORMAT  CHM13   Sample1 Sample2 Sample3
chr1    5   var1    A   TTT AN=7;AF=0.428571;AC=3;DUP_ID=var2:var3  GT  1   0|0 1|0 0|1
chr1    5   var4    A   TTT AN=7;AF=0.428571;AC=3;DUP_ID=var5   GT  0   0|0 1|1 0|1
chr1    5   var6    A   C   AN=6;AF=0.333333;AC=2   GT  0   0|0 1|0 .|1
chr1    6   var7    A   TTT AN=6;AF=0.333333;AC=2   GT  0   0|0 1|0 .|1

I also tested the current script on HPRCv1.1 chr22:

> bcftools norm -r chr22 -f ref.fa hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz -Ou | bcftools sort -Oz -o hprc.chr22.vcf.gz
> python merge_duplicates.py -i hprc.chr22.vcf.gz -o hprc.chr22.out.vcf.gz

[2024-10-04 11:35:01] - [INFO]: Merge duplicated variants from: hprc.chr22.vcf.gz
[2024-10-04 11:35:01] - [WARNING]: missing genotype conflict at chr22:15226565: >45350772>45350778_5 and >45350778>45350782_3
...
[2024-10-04 11:35:18] - [INFO]: Read 439622 variants
[2024-10-04 11:35:18] - [INFO]: 2108 variants are merged
[2024-10-04 11:35:18] - [WARNING]: 28 variants have non-missing genotypes merged with missing genotypes
[2024-10-04 11:35:18] - [INFO]: 440 variants are not merged due to haplotype conflict
[2024-10-04 11:35:18] - [INFO]: Write 437514 variants to hprc.chr22.out.vcf.gz

I am still considering how to handle duplicate variants with conflict genotypes:

1|0 vs 1|0 :
- don't merge, write both to output (now)
- don't merge, write the first one (like bcftools norm, make sure no duplicate in output)
- merge anyway, output 1|0
0|1 vs .|0 :
- merge to 0|1 with warning (now, is it safe?)
- don't merge, write both / first

For duplicate variants with 1|0 vs 1|0, do you think these are really redundantly called by cactus or different variants with the same VCF representation due to the limitation of VCF?

glennhickey commented 1 month ago

Wow, this looks really promising, thanks for sharing! I will take a closer look when I have a free minute (which unfortunately won't be for a few days), but am very interested in incorporating it if possible into cactus.

glennhickey commented 4 weeks ago

Just writing to say: I haven't forgotten about this -- still look forward to trying it out and hopefully incorporating it or something like it in the pipeline once I get the chance.

Han-Cao commented 4 weeks ago

Please take your time. If you have any suggestions on the script, I am happy to revise.

Besides, I added some options (--keep and --missing) to allow the script to have different behavior when merging conflicting genotypes. Its default output is still the same as the previous version. I will try to do some evaluation with different parameters and will let you know the results.

Han-Cao commented 2 days ago

Hi @glennhickey ,

I just did some analysis to evaluate different methods for merging duplicated variants.

I merged duplicated variants in the biallelic HPRC v1.1 VCF (decomposed by vcfwave) and compared its genotypes with that in the 1000G 30X release. All analyses are performed on chr20 using the overlapped samples between HPRC and 1000G (n = 39). Only duplicated variants in the HPRC VCF are included.

The variant merging strategies are:

bcftools norm --rm-dup exact: only keep the first one and drop all the other duplicates
merge_duplicates.py --keep all: merge phased genotypes if no conflict in any samples, keep duplicates that cannot be merged into the first one
merge_duplicates.py --keep all --missing conflict: similar as the above one, but consider missing genotype and non-missing genotype as conflict (i.e., 1|. vs 0|0)
merge_duplicates.py --keep first: merge phased genotypes if no conflict in any samples, drop duplicates that cannot be merged into the first one
merge_duplicates.py --keep first--missing conflict: similar as the above one, but consider missing genotype and non-missing genotype as conflict

The weighted genotype concordance (as defined in the PanGenie paper) of duplicated variants are calculated for different methods:

Method	All	SNP	INDEL
no merge	0.6517	0.5251	0.6574
`bcftools norm --rm-dup exact`	0.6733	0.5902	0.6715
`--keep all`	0.8180	0.8681	0.8060
`--keep all`, `--missing conflict`	0.8145	0.8681	0.8023
`--keep first`	0.8509	0.8681	0.8409
`--keep first`, `--missing conflict`	0.8483	0.8681	0.8380

Based on the above results:

bcftools norm does not work well for pangenome VCF
Merging genotypes can significantly improve the genotype concordance when comparing with variants called from a linear reference genome.

I also tested whether 1 variant were split into multi VCF records in the pangenome VCF. I calculated the allelic difference of duplicated variants between the pangenome VCF and 1000G VCF (i.e., AC_HPRC - AC_1000G). Without variant merging (right panel), variants in the pangenome VCF have smaller AC than those in 1000G VCF, and merging the duplicated variants can make the difference more centralized to 0 (left panel), suggesting it recovers some splitted variants in the pangenome VCF. bcftools norm does not affect the allelic difference a lot as it only drops duplicated variants (middle panel).

Besides, variants in the merged pangenome VCF tends to have higher AC than the 1000G VCF, I am thinking whether this is due to:

2 variants are incorrectly merge into one (e.g., variant nested in SV merged with another one not in SV)
the 1000G VCF may drop variants in some samples as it could be processed by bcftools norm --rm-dup (I didn't confirm how they remove the duplicates)

Do you have any comment on this?

You may find the following links for the VCF used in the above analysis:

hprc-v1.1-mc-grch38.vcfwave.chr20.target_var.vcf.gz.txt 1kGP_high_coverage_Illumina.chr20.target_var.vcf.gz.txt

glennhickey commented 19 hours ago

Wow, thank you for this. This looks brilliant, and very convincing. I'm merging in the new vclib that purports to fix your vcfwave issue and would like to finally getting around to pulling this in too.

I'm still uncertain as to the best approach for the conflicts and struggle to understand why it should be an issue -- it seems like if there is a conflict then bcftools norm maybe shouldn't have left-aligned it to begin with.... That said your best-scoring option, --keep first, sounds reasonable.

Anyway, I'll make a PR soon with this incorporated and let you know. I'm working on the next HPRC release now and think it'd be great to have this, if possible, used to clean up the released VCFs.

Han-Cao commented 7 hours ago

Glad to hear that it can help! Please let me know if you encounter any bugs or if any modifications are needed when incorporating the script into the pipeline. I'm happy to make revisions.

ComparativeGenomicsToolkit / cactus

Duplicated variants in a pangenome VCF #1493