Closed bschilder closed 2 years ago
Just tried an alternative function instead of read_vcf
, and noticed some things:
path <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz"
vcf <- VariantAnnotation::readVcf(file = path)
print(vcf)
The vcf object contains multiple fields in the geno. One of them is "AF" (alternative allele frequency), and another is "SI" (imputation accuracy).
Given that our original method is rather messy and prone to missing a lot of these pieces of info, I think we should move towards using VariantAnnotation::readVcf
. I'll try to write a new vcf function that does this, and then converts to data.table format.
class: CollapsedVCF
dim: 11734353 1
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 1 column: AF
info(header(vcf)):
Number Type Description
AF A Float Allele Frequency
geno(vcf):
List of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID
geno(header(vcf)):
Number Type Description
ES A Float Effect size estimate relative to the alternative allele
SE A Float Standard error of effect size estimate
LP A Float -log10 p-value for effect estimate
AF A Float Alternate allele frequency in the association study
SS A Float Sample size used to estimate genetic effect
EZ A Float Z-score provided if it was used to derive the EFFECT and SE fields
SI A Float Accuracy score of summary data imputation
NC A Float Number of cases used to estimate genetic effect
ID 1 String Study variant identifier
Definitely makes sense to move over to a more robust read_vcf
version. On the Allele Frequency (AF), how does AF differ from minor allele frequency? I assumed they were the same - this is why I picked up AF as INFO
We discussed this in person, but just to document here: INFO can contain different fields that vary across files and within files (across rows). This means that using a simple parsing strategy is likely to make a lot of mistakes. For this reason, I rewrote read_vcf
(and various other support functions) to take advantage of VariantAnnotation
, which robustly parses all fields of any given VCF. read_vcf
can write/return as VCF or data.table (via the vcf2df
internal function).
The downside is that read_vcf is now much slower (<1 min vs. 9 min for 11M variants) I've been urging the VariantAnnotation
maintainers to improve the efficiency of these functions so we can reduce this compute time.
https://github.com/Bioconductor/VariantAnnotation/issues/57
1. Bug description
I've applied these fixes to
MungeSumstats
v1.3.20check_info_score
step.check_info_score
:log_files$info_filter
in these instances.Expected behaviour
Don't apply INFO score filtering when the INFO col actually contains allele frequency (AF).
2. Reproducible example
Code
Console output: MungeSumstats 1.3.19
% SNPs kept by the end was very low:
Console output: MungeSumstats 1.3.20
After applying some fixes, this is the result. Drastically more SNPs are kept:
3. Session info