Open davmlaw opened 1 year ago
Tried generating mapping files:
from snpdb.models import GenomeBuild, Contig
builds = {"GRCh37": GenomeBuild.grch37(), "GRCh38": GenomeBuild.grch38()}
for name, genome_build in builds.items():
filename = f"snpdb/genome/reference/chrom_contig_mapping_{name.lower()}.txt"
with open(filename, "w") as f:
for c in genome_build.contigs:
if c.refseq_accession != 'na':
f.write(f"{c.name} {c.refseq_accession}\n")
if c.ucsc_name not in (None, 'na'):
f.write(f"{c.ucsc_name} {c.refseq_accession}\n")
Then added to pipeline:
bcftools annotate --rename-chrs /home/dlawrence/localwork/variantgrid/snpdb/genome/reference/chrom_contig_mapping_grch38.txt
This still caused issues though. I also switched out vt for bcftools:
cat /tmp/tso500_cnv_decomposed.vcf | bcftools norm --check-ref=w --fasta-ref /data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
And that had same issue
bcftools: bgzf.c:2556: bgzf_useek: Assertion `fp->block_offset <= fp->block_length' failed.
Old issue:
VT is currently dying with CNV VCFs (see isssue #54), due to
[W::vcf_parse] Contig 'NC_000005.9' is not defined in the header. (Quick workaround: index the file with tabix.)
vt: bgzf.c:2404: bgzf_useek: Assertion `fp->block_offset <= fp->block_length' failed.
I found that if I add the contig to the header, it works fine.
The VCF header is not replaced as we are doing it manually in vcf_clean_and_filter.py
We should instead switch to using bcftools annotate
as that also does the header (and is more standard etc)
So the crash was due to my fasta. I unzipped then bgzipped it again.
But I think I'll keep this change in, as it reduces the warnings by about 26 lines per VCF import
In vcf_clean_and_filter.py we rename the chroms / contigs etc.
We should also rename the header as that is giving us lots of errrors/warnings downstream