freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

Systematic IFFY tag when indels are not trimmed #3

Open sounkou-bioinfo opened 7 months ago

sounkou-bioinfo commented 7 months ago

Hi @freeseek , Thanks a lot for the tools provided in this repo. I have a question with regard to the munging plugin, it appears that indel variants are systematically flagged when the alleles are not trimmed (examples attached) image is this an expected behavior of the tool ? Thanks

freeseek commented 7 months ago

The problem here is that for those indels (and this is true for the majority of indels), it is not possible to understand which allele is the reference and which allele is the alternate by comparing the two alleles to the reference, as both alleles match the reference. This is one of the main reasons many summary statistics formats are flawed and their format should be replaced by something like the GWAS-VCF standard. In the case of your summary statistics file, do you think you would have information encoded in the file to allow recognition of which allele is the reference allele and which allele is the alternate allele? If so, show me what it looks like and I will update the BCFtools/munge plugin to handle this correctly

sounkou-bioinfo commented 7 months ago

Thank you for your reply As you may already know a strategy here - as done with variant calling benchmarking pipelines- would be to left align indels given that different variant caller, datasets will represent indels differently. Bcftools implements such method for dealing with indels. I don't have any particular information on the summary statistics i tried to convert here other than the fact that it was imputed on GRCh37 (which exact flavor of that build i cannot locate).

freeseek commented 6 months ago

This problem is not related to left aligning. Those three IFFY variants in your example could correspond to multiple variants encoded by the VCF specification as follows:

1   779310  .   TGA T   .   .   .
1   779310  .   T   TGA .   .   .
1   779797  .   GCTCC   G   .   .   .
1   779797  .   G   GCTCC   .   .   .
1   780347  .   TTTAA   T   .   .   .
1   780347  .   T   TTTAA   .   .   .

These are all different variants. How do you figure out which ones are those matching your summary statistics file?

sounkou-bioinfo commented 4 months ago

@freeseek thank you for the reply. You are right that there are no obvious ways to resolve this ambiguity as such. Only external infos like reference allele frequency data could potentially resolve this (if the variants in the reference vcf file in the region are not ambiguous). But this could be a very expansive solution even if only 10% of the variants (mostly imputed indels ) are affected.