SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

VCF - support genomes (WGS) by pre-processing by gnomad population frequency #521

Closed davmlaw closed 2 years ago

davmlaw commented 3 years ago

Andreas suggested people may want to look at the "low hanging fruit" eg exomes, below X% in gnomad etc etc

At the moment all genotypes for a VCF are stored in 1 big partition.

If we pre-processed the VCF into exome regions or below gnomad thresholds etc, then we could then make much more efficient queries by only asking for those partitions

davmlaw commented 2 years ago

For diagnostics, their use of population frequency filters is:

At the variant level, most variants are rare - so the smaller table only including gnomad < 5 % etc isn't that useful

Variant Level (Annotation latest 38)

total: 4,450,802

Sample level

Total: 931,858 <= 5% 107,957 (11.5%) <= 1% 86,427 (9.27%) <= 0.1 71,195 (7.6%)

So, the sample level division into bits according to gnomad may be the biggest win, then we just request rare ones from the sample level

Poppulation node filtering is obviously much quicker if the input size is smaller

Seems worth doing at a 5% gnomad_af level I think (as gnomad is < popmax so can always apply popmax after)

davmlaw commented 2 years ago

We won't be reloading all VCFs - so we need a flag on the VCF model that indicates whether the data has been split into rare/other etc.

We can then reload VCFs as needed

On the node, we need to be able to request that the sample will be filtered to rare only. Can then pass that onto the sample/VCF, which can handle that based on flag above

davmlaw commented 2 years ago

Write files that contain common gnomAD variants

bcftools view -i "AF>0.05" gnomad_GRCh37_combined_af.vcf.bgz --output-file gnomad_GRCh37_af_greater_than_5.vcf.bgz
tabix gnomad_GRCh37_af_greater_than_5.vcf.bgz
bcftools view -i "AF>0.05" gnomad_GRCh38_combined_af.vcf.bgz --output-file gnomad_GRCh38_af_greater_than_5.vcf.bgz
tabix gnomad_GRCh38_af_greater_than_5.vcf.bgz

Our fasta uses refseq accessions (NC_000001.10) so we convert all our VCFs into that format so we can normalize with VT

So, we need our gnomAD in that format too.

Create contig map files:

from snpdb.models import GenomeBuild

for genome_build in GenomeBuild.builds_with_annotation():
    filename = f"chr_contig_{genome_build.name}.map"
    with open(filename, "w") as f:
        for c in genome_build.contigs:
            f.write(f"{c.name} {c.refseq_accession} {c.name}\n")
    print(f"Wrote {filename}")
bcftools annotate -x ^INFO/AF --rename-chrs chr_contig_GRCh37.map --output-type z --output gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz gnomad_GRCh37_af_greater_than_5.vcf.bgz
bcftools annotate -x ^INFO/AF --rename-chrs chr_contig_GRCh38.map --output-type z --output gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz gnomad_GRCh38_af_greater_than_5.vcf.bgz

You could pipe the bcftools at the top and bottom of the comment together to make it quicker

davmlaw commented 2 years ago

There's 2 ways to separate them, either annotate (then while parsing VCF) or via splitting VCFs into 2. I used annotation but worth trying isec, might make it smaller and faster. See https://samtools.github.io/bcftools/bcftools.html#isec

OK, have added AF (for>0.05) onto variants during import.

DECISIONS:

Need to work out how to store it - adding a column or splitting up rare/common into separate cohort genotype collection partitions.

Splitting into partitions

TODO:

davmlaw commented 2 years ago

Working in branch cohort_genotype_rare_common

Benchmark of 5 runs (after 2 test ones) using hss2008 trio old - 12.85 rare/common - 8.96s So 30% faster. This seems worth the extra complexity. It'll also be much faster on really big samples and at the lower nodes

Something I didn't think of is that we have an option on population node which allows variants classified as pathogenic to come through regardless of frequency - this means we're not able to do our trick in that case...

You could check whether the common variants contain a pathogenic that's above 5% frequency (surely not many)

pathogenic_qs = Classification.objects.filter(clinical_significance__in=['4', '5'])
variant_q = Classification.get_variant_q_from_classification_qs(pathogenic_qs, GenomeBuild.grch37())
# TODO: Could filter pathogenics down to those that are above 5%

cgc = CohortGenotypeCollection.objects.get(pk=330)  # The common one
qs = Variant.objects.all()
qs.annotate(**cgc.get_annotation_kwargs()).filter(variant_q).exists()

Depending on how long this takes (it's probably super fast), you may want to cache it. Seeing as not many variants are withdrawn, you could store a flag on the common variant collection if it contains it, so you don't need to check again. Could also store date checked or maybe max classification number.

Another option is to move pathogenic over 5% into rare bucket for all that have them

davmlaw commented 2 years ago

apt-get install bcftools

Either download from:

# Copy to VEP/annotation_data/GRCh37
wget http://variantgrid.com/static/annotation_download/GRCh37/gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz
wget http://variantgrid.com/static/annotation_download/GRCh37/gnomad_GRCh37_af_greater_than_5.contigs.vcf.bgz.tbi

# Copy to VEP/annotation_data/GRCh38
wget http://variantgrid.com/static/annotation_download/GRCh38/gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz
wget http://variantgrid.com/static/annotation_download/GRCh38/gnomad_GRCh38_af_greater_than_5.contigs.vcf.bgz.tbi

Or create yourself via annotation/annotation_data/generate_gnomad_upload_preprocess.sh