SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

VCF pre-processing - change chrom/contig headers #811

Open davmlaw opened 1 year ago

davmlaw commented 1 year ago

In vcf_clean_and_filter.py we rename the chroms / contigs etc.

We should also rename the header as that is giving us lots of errrors/warnings downstream

davmlaw commented 1 year ago

Tried generating mapping files:

from snpdb.models import GenomeBuild, Contig
builds = {"GRCh37": GenomeBuild.grch37(), "GRCh38": GenomeBuild.grch38()}

for name, genome_build in builds.items():
    filename = f"snpdb/genome/reference/chrom_contig_mapping_{name.lower()}.txt"
    with open(filename, "w") as f:
        for c in genome_build.contigs:
            if c.refseq_accession != 'na':
                f.write(f"{c.name} {c.refseq_accession}\n")
                if c.ucsc_name not in (None, 'na'):
                    f.write(f"{c.ucsc_name} {c.refseq_accession}\n")

Then added to pipeline:

bcftools annotate --rename-chrs /home/dlawrence/localwork/variantgrid/snpdb/genome/reference/chrom_contig_mapping_grch38.txt 

This still caused issues though. I also switched out vt for bcftools:

cat /tmp/tso500_cnv_decomposed.vcf | bcftools norm --check-ref=w --fasta-ref /data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz 

And that had same issue

bcftools: bgzf.c:2556: bgzf_useek: Assertion `fp->block_offset <= fp->block_length' failed.
davmlaw commented 1 year ago

Old issue:

VT is currently dying with CNV VCFs (see isssue #54), due to

[W::vcf_parse] Contig 'NC_000005.9' is not defined in the header. (Quick workaround: index the file with tabix.)
vt: bgzf.c:2404: bgzf_useek: Assertion `fp->block_offset <= fp->block_length' failed.

I found that if I add the contig to the header, it works fine.

The VCF header is not replaced as we are doing it manually in vcf_clean_and_filter.py

We should instead switch to using bcftools annotate as that also does the header (and is more standard etc)

So the crash was due to my fasta. I unzipped then bgzipped it again.

But I think I'll keep this change in, as it reduces the warnings by about 26 lines per VCF import