freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

Assertion error for some variants with irregular GT value #1

Closed emmaver closed 9 months ago

emmaver commented 1 year ago

Hello,

I am having issues running bcftools +liftover for some of our samples, which are exome VCF files from single individuals. I am running bcftools 1.16 / Using htslib 1.16.

I get the following error for 2/3 of our VCFs:

Assertion failed: (len == n_als * (n_als + 1) / 2), function update_AGR_record, file liftover.c, line 1471.

I was able to identify some of the variants that were causing this error, and it seemed to be cases with irregular genotype GT=1 where one would actually expect a diploid call (not on chromosome X, Y or M). After some more testing, the problem seemed to be with the value of the PL key and only for some cases.

In the VCF header, PL is defined as "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification", and should have 3 values for biallelic sites corresponding to AA,AB,BB. The problematic variants only have 2 values, but only some of the described cases are problematic and others are not, depending on the genomic position.

For example (only showing VCF columns CHROM, POS, INFO, FORMAT, SAMPLE):

## this variant is causing the error
chr14   106055787  SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1:99:19:17:0:0,17:0,6:0,11:-28.3:PASS:297,0

## this variant is not
chr14  76959815  SNVHPOL=3;MQ=60  GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL     1:99:16:16:0:0,16:0,10:0,6:-24.5:PASS:255,0

Thanks in advance for any help!

freeseek commented 1 year ago

Hmmmmmmmmmmmmmmmm. Both variants should be causing an error.

It is not correct for PL to have two values if the genotype is haploid. How did you generate the VCF? Also, the development version at https://software.broadinstitute.org/software/score/ should give better error messages now.

Also notice that the ADF/ADR/SB fields will not be lifted over correctly. You could remove the problematic fields with bcftools annotate -x. But for sequencing VCFs my recommendation is to realign the sequencing data rather than performing a liftover.

Giulio

On Tue, May 9, 2023, 5:36 AM emmaver @.***> wrote:

Hello,

I am having issues running bcftools +liftover for some of our samples, which are exome VCF files from single individuals. I am running bcftools 1.16 / Using htslib 1.16.

I get the following error for 2/3 of our VCFs:

Assertion failed: (len == n_als * (n_als + 1) / 2), function update_AGR_record, file liftover.c, line 1471.

I was able to identify some of the variants that were causing this error, and it seemed to be cases with irregular genotype GT=1 where one would actually expect a diploid call (not on chromosome X, Y or M). After some more testing, the problem seemed to be with the value of the PL key and only for some cases.

In the VCF header, PL is defined as "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification", and should have 3 values for biallelic sites corresponding to AA,AB,BB. The problematic variants only have 2 values, but only some of the described cases are problematic and others are not, depending on the genomic position.

For example (only showing VCF columns CHROM, POS, INFO, FORMAT, SAMPLE):

this variant is causing the error

chr14 106055787 SNVHPOL=4;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1:99:19:17:0:0,17:0,6:0,11:-28.3:PASS:297,0

this variant is not

chr14 76959815 SNVHPOL=3;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL 1:99:16:16:0:0,16:0,10:0,6:-24.5:PASS:255,0

Thanks in advance for any help!

— Reply to this email directly, view it on GitHub https://github.com/freeseek/score/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6TUTEJP6KU6EE4NOUY4DXFIF3VANCNFSM6AAAAAAX3BD4JI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

emmaver commented 1 year ago

Hi Giulio,

Thanks for the quick answer! The VCF was generated by the Strelka variant calling pipeline, run by the company that did our sequencing. We are currently discussing with them to clarify these weird occurrences. In the meantime I can do liftover after removing the PL field, thank you for pointing out bcftools annotate -x.

The development version indeed tells me immediately that the problem was with PL. It doesn't show which variant is causing the error though. This would be a nice addition to the plugin if feasible :)

Best, Emma