Closed davmlaw closed 2 years ago
For diagnostics, their use of population frequency filters is:
At the variant level, most variants are rare - so the smaller table only including gnomad < 5 % etc isn't that useful
Variant Level (Annotation latest 38)
total: 4,450,802
Sample level
Total: 931,858 <= 5% 107,957 (11.5%) <= 1% 86,427 (9.27%) <= 0.1 71,195 (7.6%)
So, the sample level division into bits according to gnomad may be the biggest win, then we just request rare ones from the sample level
Poppulation node filtering is obviously much quicker if the input size is smaller
Seems worth doing at a 5% gnomad_af level I think (as gnomad is < popmax so can always apply popmax after)
We won't be reloading all VCFs - so we need a flag on the VCF model that indicates whether the data has been split into rare/other etc.
We can then reload VCFs as needed
On the node, we need to be able to request that the sample will be filtered to rare only. Can then pass that onto the sample/VCF, which can handle that based on flag above
Write files that contain common gnomAD variants
bcftools view -i "AF>0.05" gnomad_GRCh37_combined_af.vcf.bgz --output-file gnomad_GRCh37_af_greater_than_5.vcf.bgz
tabix gnomad_GRCh37_af_greater_than_5.vcf.bgz
bcftools view -i "AF>0.05" gnomad_GRCh38_combined_af.vcf.bgz --output-file gnomad_GRCh38_af_greater_than_5.vcf.bgz
tabix gnomad_GRCh38_af_greater_than_5.vcf.bgz
Our fasta uses refseq accessions (NC_000001.10) so we convert all our VCFs into that format so we can normalize with VT
So, we need our gnomAD in that format too.
Create contig map files:
from snpdb.models import GenomeBuild
for genome_build in GenomeBuild.builds_with_annotation():
filename = f"chr_contig_{genome_build.name}.map"
with open(filename, "w") as f:
for c in genome_build.contigs:
f.write(f"{c.name} {c.refseq_accession} {c.name}\n")
print(f"Wrote {filename}")
bcftools annotate -x ^INFO/AF --rename-chrs chr_contig_GRCh37.map --output-type z --output gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz gnomad_GRCh37_af_greater_than_5.vcf.bgz
bcftools annotate -x ^INFO/AF --rename-chrs chr_contig_GRCh38.map --output-type z --output gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz gnomad_GRCh38_af_greater_than_5.vcf.bgz
You could pipe the bcftools at the top and bottom of the comment together to make it quicker
There's 2 ways to separate them, either annotate (then while parsing VCF) or via splitting VCFs into 2. I used annotation but worth trying isec, might make it smaller and faster. See https://samtools.github.io/bcftools/bcftools.html#isec
OK, have added AF (for>0.05) onto variants during import.
DECISIONS:
Need to work out how to store it - adding a column or splitting up rare/common into separate cohort genotype collection partitions.
Splitting into partitions
TODO:
Working in branch cohort_genotype_rare_common
Benchmark of 5 runs (after 2 test ones) using hss2008 trio old - 12.85 rare/common - 8.96s So 30% faster. This seems worth the extra complexity. It'll also be much faster on really big samples and at the lower nodes
Something I didn't think of is that we have an option on population node which allows variants classified as pathogenic to come through regardless of frequency - this means we're not able to do our trick in that case...
You could check whether the common variants contain a pathogenic that's above 5% frequency (surely not many)
pathogenic_qs = Classification.objects.filter(clinical_significance__in=['4', '5'])
variant_q = Classification.get_variant_q_from_classification_qs(pathogenic_qs, GenomeBuild.grch37())
# TODO: Could filter pathogenics down to those that are above 5%
cgc = CohortGenotypeCollection.objects.get(pk=330) # The common one
qs = Variant.objects.all()
qs.annotate(**cgc.get_annotation_kwargs()).filter(variant_q).exists()
Depending on how long this takes (it's probably super fast), you may want to cache it. Seeing as not many variants are withdrawn, you could store a flag on the common variant collection if it contains it, so you don't need to check again. Could also store date checked or maybe max classification number.
Another option is to move pathogenic over 5% into rare bucket for all that have them
apt-get install bcftools
Either download from:
# Copy to VEP/annotation_data/GRCh37
wget http://variantgrid.com/static/annotation_download/GRCh37/gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz
wget http://variantgrid.com/static/annotation_download/GRCh37/gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz.tbi
# Copy to VEP/annotation_data/GRCh38
wget http://variantgrid.com/static/annotation_download/GRCh38/gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz
wget http://variantgrid.com/static/annotation_download/GRCh38/gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz.tbi
Or create yourself via annotation/annotation_data/generate_gnomad_upload_preprocess.sh
Andreas suggested people may want to look at the "low hanging fruit" eg exomes, below X% in gnomad etc etc
At the moment all genotypes for a VCF are stored in 1 big partition.
If we pre-processed the VCF into exome regions or below gnomad thresholds etc, then we could then make much more efficient queries by only asking for those partitions