HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
247 stars 27 forks source link

Value Error: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xf12\xdeB' #276

Closed husamia closed 8 months ago

husamia commented 9 months ago

command

using docker docker run --rm -it -v /mnt/e/E:/EEE hkubal/clair3:latest bash

/opt/bin/run_clair3.sh --bam_fn=/data/41195.minimap2.bam --ref_fn=/data/Homo_sapiens_assembly38.fasta.gz --output=/data/Clair3_41195 --remove_intermediate_dir --enable_long_indel --threads=40 --platform=ont --model_path=/opt/models/r941_prom_sup_g5014 --sample_name=41195

Error


# Working on contig chr2 in individual 41195
Found 176014 usable heterozygous variants (0 skipped due to missing genotypes)
Traceback (most recent call last):
  File "/opt/conda/envs/clair3/bin/whatshap", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/__main__.py", line 64, in main
    module.main(args)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/cli/phase.py", line 1169, in main
    run_whatshap(**vars(args))
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/cli/phase.py", line 493, in run_whatshap
    readset, vcf_source_ids = phased_input_reader.read(
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/cli/__init__.py", line 152, in read
    readset = readset_reader.read(chromosome, variants, bam_sample, reference, regions)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/variants.py", line 98, in read
    readset = self._make_readset_from_grouped_reads(grouped_reads)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/variants.py", line 104, in _make_readset_from_grouped_reads
    for group in groups:
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/variants.py", line 119, in _group_paired_reads
    for read in reads:
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/whatshap/variants.py", line 166, in _alignments_to_reads
    reference = reference[:]
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/pyfaidx/__init__.py", line 920, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/pyfaidx/__init__.py", line 1149, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/pyfaidx/__init__.py", line 727, in fetch
    seq = self.from_file(name, start, end)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/pyfaidx/__init__.py", line 769, in from_file
    self.file.seek(i.offset)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/Bio/bgzf.py", line 682, in seek
    self._load_block(start_offset)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/Bio/bgzf.py", line 643, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/opt/conda/envs/clair3/lib/python3.9/site-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
    raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xf12\xdeB'; handle.tell() now says 3840
zhengzhenxian commented 9 months ago

Hi, @husamia,

It seems either BAM or the reference file is a corrupted BGZF file. Could you try to decompress the BAM and reference file and re-index them to see if the issue persists?

The issue is also probably an issue in WhatsHap reported by other users before or an issue with the reference reported here.

ASLeonard commented 9 months ago

I had a similar issue, but after installing everything from source (including whatshap v2.3-dev with py3.11 instead of v1.7 with py3.9), the error went away. The whatshap changelog doesn't indicate this would be a fixed issue, but could try update the clair3 docker with whatshap v2.2 and see?

husamia commented 8 months ago

The warning message seems to be ignored and I can still get the results. I wasn't sure if it affects the results. I suspect some reads in the BAM are not being encoded properly! I tested that the reference archive was not corrupted by extracting it. I suspect this is BAM related. Those are long reads from NAnopore.

ASLeonard commented 8 months ago

So the error still appears seemingly at random (maybe 60% of the time using different bam but the same reference) even with whatshap v2.2. When it does appear, clair3 fails because there are no phased variants to use in later steps.

However, it appears that using pysam.FastaFile instead of pyfaidx.Fasta within whatshap fixes this, as pysam appears to have correct bgzf support for fetching sequence. Sort of a painful solution if you are installing whatshap through conda (you can edit the installed files directly), but it is doable.

https://github.com/whatshap/whatshap/blob/cefdaececfbb1aa63176da301a0c13ff368aceb4/whatshap/utils.py#L59

I tested this directly with pyfaidx, so I am confident this is an issue with pyfaidx loading from gzi indices rather than corrupted references or bam files, as the following code works with pysam.

import pyfaidx
f = pyfaidx.Faidx('<my bgzf reference>')
f.fetch('1',100,200)
   >1:100-200
   atcacatgactgatcatgcactgatcacgtgcctgatcatgcactgatcccgtggcagatcatgcactgatcacgtgcagatcatgcactcatcatgtggc

f.fetch('2',100,200)
Traceback (most recent call last):
... exact same error I get during clair3 ...
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xb0\x90\xee\xe5'; handle.tell() now says 37674
ASLeonard commented 8 months ago

I reran a sample that finished successfully but had whatshap errors with the pysam replacement, and in the end it changed from 16.4 million variants to 16.6 million (1.3% increase). So not terrible, but substantial enough to care about.

dingyigithub commented 2 weeks ago

I met the same error when running whatshap 2.3. It is also reported by the pyfaidx when fetching the reference sequence. I decompressed my gzipped fasta and the error disappeared.