Closed husamia closed 8 months ago
Hi, @husamia,
It seems either BAM or the reference file is a corrupted BGZF file. Could you try to decompress the BAM and reference file and re-index them to see if the issue persists?
The issue is also probably an issue in WhatsHap reported by other users before or an issue with the reference reported here.
I had a similar issue, but after installing everything from source (including whatshap v2.3-dev with py3.11 instead of v1.7 with py3.9), the error went away. The whatshap changelog doesn't indicate this would be a fixed issue, but could try update the clair3 docker with whatshap v2.2 and see?
The warning message seems to be ignored and I can still get the results. I wasn't sure if it affects the results. I suspect some reads in the BAM are not being encoded properly! I tested that the reference archive was not corrupted by extracting it. I suspect this is BAM related. Those are long reads from NAnopore.
So the error still appears seemingly at random (maybe 60% of the time using different bam but the same reference) even with whatshap v2.2. When it does appear, clair3 fails because there are no phased variants to use in later steps.
However, it appears that using pysam.FastaFile
instead of pyfaidx.Fasta
within whatshap fixes this, as pysam appears to have correct bgzf support for fetching sequence. Sort of a painful solution if you are installing whatshap through conda (you can edit the installed files directly), but it is doable.
I tested this directly with pyfaidx, so I am confident this is an issue with pyfaidx loading from gzi indices rather than corrupted references or bam files, as the following code works with pysam.
import pyfaidx
f = pyfaidx.Faidx('<my bgzf reference>')
f.fetch('1',100,200)
>1:100-200
atcacatgactgatcatgcactgatcacgtgcctgatcatgcactgatcccgtggcagatcatgcactgatcacgtgcagatcatgcactcatcatgtggc
f.fetch('2',100,200)
Traceback (most recent call last):
... exact same error I get during clair3 ...
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xb0\x90\xee\xe5'; handle.tell() now says 37674
I reran a sample that finished successfully but had whatshap errors with the pysam replacement, and in the end it changed from 16.4 million variants to 16.6 million (1.3% increase). So not terrible, but substantial enough to care about.
I met the same error when running whatshap 2.3. It is also reported by the pyfaidx when fetching the reference sequence. I decompressed my gzipped fasta and the error disappeared.
command
using docker
docker run --rm -it -v /mnt/e/E:/EEE hkubal/clair3:latest bash
/opt/bin/run_clair3.sh --bam_fn=/data/41195.minimap2.bam --ref_fn=/data/Homo_sapiens_assembly38.fasta.gz --output=/data/Clair3_41195 --remove_intermediate_dir --enable_long_indel --threads=40 --platform=ont --model_path=/opt/models/r941_prom_sup_g5014 --sample_name=41195
Error