Open Han-Cao opened 1 month ago
You can try merging them with vcfcreatemulti
. This tools is part of vcftools and should be available in the cactus docker image.
I have tried to use vcfcreatemulti
to merge them. The output is:
chr22 15409341 >45360363>45360365 G GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA,GAGTGTTTGTCCCACACAAAGGATTGCAGAACAATGTTGCTGGGTTCTTAGTGTTTGACCCTCACATAGGATTCCAGAACATAGCTGCTGGTTCAA 60 . AC=1;AF=0.0113636;AN=88;AT=>45360363>45360365,>45360363<45360364>45360365;NS=45;LV=0;combined=15409341-15409341 GT 0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 .|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
But there are 2 problems:
I was planning to do something like:
chr1 100 var1 G GAA 0|0 0|1 1|0
chr1 100 var1 G GAA 0|0 1|0 0|0
merge into
chr1 100 var1 G GAA 0|0 1|1 1|0
If you think this strategy is OK to handle duplicated variants in MC output VCF, I can write some script to do it.
Many thanks.
That's disappointing about vcfcreatemulti
. I was hoping it was to do something like what you said you were planning to do, which seems fine. If you do come up with a general script for doing this, please share and I will look at incorporating it into cactus...
I just wrote a prototype python script. You can find it here
It is now designed for phased VCF only:
For below VCF:
#CHROM POS ID REF ALT INFO FORMAT CHM13 Sample1 Sample2 Sample3
chr1 5 var1 A TTT AN=7;AF=0.285714;AC=2 GT 0 0|0 1|0 0|1
chr1 5 var2 A TTT AN=7;AF=0.142857;AC=1 GT 1 0|0 0|0 0|0
chr1 5 var3 A TTT AN=6;AF=0;AC=0 GT . 0|0 0|0 0|0
chr1 5 var4 A TTT AN=7;AF=0.142857;AC=1 GT 0 0|0 1|0 0|0
chr1 5 var5 A TTT AN=7;AF=0.285714;AC=2 GT 0 0|0 0|1 0|1
chr1 5 var6 A C AN=6;AF=0.333333;AC=2 GT 0 0|0 1|0 .|1
chr1 6 var7 A TTT AN=6;AF=0.333333;AC=2 GT 0 0|0 1|0 .|1
It output
#CHROM POS ID REF ALT INFO FORMAT CHM13 Sample1 Sample2 Sample3
chr1 5 var1 A TTT AN=7;AF=0.428571;AC=3;DUP_ID=var2:var3 GT 1 0|0 1|0 0|1
chr1 5 var4 A TTT AN=7;AF=0.428571;AC=3;DUP_ID=var5 GT 0 0|0 1|1 0|1
chr1 5 var6 A C AN=6;AF=0.333333;AC=2 GT 0 0|0 1|0 .|1
chr1 6 var7 A TTT AN=6;AF=0.333333;AC=2 GT 0 0|0 1|0 .|1
I also tested the current script on HPRCv1.1 chr22:
> bcftools norm -r chr22 -f ref.fa hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz -Ou | bcftools sort -Oz -o hprc.chr22.vcf.gz
> python merge_duplicates.py -i hprc.chr22.vcf.gz -o hprc.chr22.out.vcf.gz
[2024-10-04 11:35:01] - [INFO]: Merge duplicated variants from: hprc.chr22.vcf.gz
[2024-10-04 11:35:01] - [WARNING]: missing genotype conflict at chr22:15226565: >45350772>45350778_5 and >45350778>45350782_3
...
[2024-10-04 11:35:18] - [INFO]: Read 439622 variants
[2024-10-04 11:35:18] - [INFO]: 2108 variants are merged
[2024-10-04 11:35:18] - [WARNING]: 28 variants have non-missing genotypes merged with missing genotypes
[2024-10-04 11:35:18] - [INFO]: 440 variants are not merged due to haplotype conflict
[2024-10-04 11:35:18] - [INFO]: Write 437514 variants to hprc.chr22.out.vcf.gz
I am still considering how to handle duplicate variants with conflict genotypes:
bcftools norm
, make sure no duplicate in output)For duplicate variants with 1|0 vs 1|0, do you think these are really redundantly called by cactus or different variants with the same VCF representation due to the limitation of VCF?
Wow, this looks really promising, thanks for sharing! I will take a closer look when I have a free minute (which unfortunately won't be for a few days), but am very interested in incorporating it if possible into cactus.
Just writing to say: I haven't forgotten about this -- still look forward to trying it out and hopefully incorporating it or something like it in the pipeline once I get the chance.
Please take your time. If you have any suggestions on the script, I am happy to revise.
Besides, I added some options (--keep
and --missing
) to allow the script to have different behavior when merging conflicting genotypes. Its default output is still the same as the previous version. I will try to do some evaluation with different parameters and will let you know the results.
Hi @glennhickey ,
I just did some analysis to evaluate different methods for merging duplicated variants.
I merged duplicated variants in the biallelic HPRC v1.1 VCF (decomposed by vcfwave
) and compared its genotypes with that in the 1000G 30X release. All analyses are performed on chr20 using the overlapped samples between HPRC and 1000G (n = 39). Only duplicated variants in the HPRC VCF are included.
The variant merging strategies are:
bcftools norm --rm-dup exact
: only keep the first one and drop all the other duplicatesmerge_duplicates.py --keep all
: merge phased genotypes if no conflict in any samples, keep duplicates that cannot be merged into the first onemerge_duplicates.py --keep all --missing conflict
: similar as the above one, but consider missing genotype and non-missing genotype as conflict (i.e., 1|. vs 0|0)merge_duplicates.py --keep first
: merge phased genotypes if no conflict in any samples, drop duplicates that cannot be merged into the first onemerge_duplicates.py --keep first--missing conflict
: similar as the above one, but consider missing genotype and non-missing genotype as conflictThe weighted genotype concordance (as defined in the PanGenie paper) of duplicated variants are calculated for different methods:
Method | All | SNP | INDEL |
---|---|---|---|
no merge | 0.6517 | 0.5251 | 0.6574 |
bcftools norm --rm-dup exact |
0.6733 | 0.5902 | 0.6715 |
--keep all |
0.8180 | 0.8681 | 0.8060 |
--keep all , --missing conflict |
0.8145 | 0.8681 | 0.8023 |
--keep first |
0.8509 | 0.8681 | 0.8409 |
--keep first , --missing conflict |
0.8483 | 0.8681 | 0.8380 |
Based on the above results:
bcftools norm
does not work well for pangenome VCF I also tested whether 1 variant were split into multi VCF records in the pangenome VCF. I calculated the allelic difference of duplicated variants between the pangenome VCF and 1000G VCF (i.e., AC_HPRC - AC_1000G
). Without variant merging (right panel), variants in the pangenome VCF have smaller AC than those in 1000G VCF, and merging the duplicated variants can make the difference more centralized to 0 (left panel), suggesting it recovers some splitted variants in the pangenome VCF. bcftools norm
does not affect the allelic difference a lot as it only drops duplicated variants (middle panel).
Besides, variants in the merged pangenome VCF tends to have higher AC than the 1000G VCF, I am thinking whether this is due to:
bcftools norm --rm-dup
(I didn't confirm how they remove the duplicates)Do you have any comment on this?
You may find the following links for the VCF used in the above analysis:
hprc-v1.1-mc-grch38.vcfwave.chr20.target_var.vcf.gz.txt 1kGP_high_coverage_Illumina.chr20.target_var.vcf.gz.txt
Wow, thank you for this. This looks brilliant, and very convincing. I'm merging in the new vclib that purports to fix your vcfwave issue and would like to finally getting around to pulling this in too.
I'm still uncertain as to the best approach for the conflicts and struggle to understand why it should be an issue -- it seems like if there is a conflict then bcftools norm
maybe shouldn't have left-aligned it to begin with.... That said your best-scoring option, --keep first
, sounds reasonable.
Anyway, I'll make a PR soon with this incorporated and let you know. I'm working on the next HPRC release now and think it'd be great to have this, if possible, used to clean up the released VCFs.
Glad to hear that it can help! Please let me know if you encounter any bugs or if any modifications are needed when incorporating the script into the pipeline. I'm happy to make revisions.
Hi,
In the output VCF of MC pipeline, there are a few duplicated variants after left-algin. Below is one example from the HPRC VCF:
Before left-align (
bcftools view -r chr22:15409352-15409435 hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz
):After
bcftools norm -f ref.fa
, these 2 insertions are exactly the same:Does is mean these 2 are the same insertion?
As the genotypes of these 2 insertions are different, to remove duplicated records from the VCF, should I keep one of them (like
bcftools norm --rm-dup exact
) or merge their genotypes (i.e., 2 AC=1 records merge into 1 AC=2 record)?Thanks a lot!