freeseek / mocha

MOsaic CHromosomal Alterations (MoChA) caller
MIT License
81 stars 23 forks source link

The sequence "hs37d5" not found and "No BGZF EOF marker" errors #36

Closed AkshajD closed 1 year ago

AkshajD commented 1 year ago

I have a single test VCF file from WGS that I am running through your data preparation pipeline. The original BAM was aligned on hs37d5 (an alternate form/version of GRCh37 from what I understand), and hence the VCF is also aligned to that version.

When running this first command with the VCF and the input files you indicated/provided for GRCh37 in the following command to "Create the minimal binary VCF":

bcftools view --no-version -h $vcf | sed 's/^\(##FORMAT=<ID=AD,Number=\)\./\1R/' | \
        bcftools reheader -h /dev/stdin $vcf | \
        bcftools filter --no-version -Ou -e "FMT/DP<10 | FMT/GQ<20" --set-GT . | \
        bcftools annotate --no-version -Ou -x ID,QUAL,^INFO/GC,^FMT/GT,^FMT/AD | \
        bcftools norm --no-version -Ou -m -any --keep-sum AD | \
        bcftools norm --no-version -Ob -f $ref | \
        tee $dir/$pfx.unphased.bcf | \
        bcftools index --force --output $dir/$pfx.unphased.bcf.csi

gives the errors:

[E::faidx_adjust_position] The sequence "hs37d5" was not found
faidx_fetch_seq failed at hs37d5:91
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
index: failed to create index for "-"

Subsequently, the following command to "generate list of variants to be excluded from modelling by both eagle and mocha":

awk -F"\t" '$2<.97 {print $1}' $crt > samples_xcl_list.txt; \
echo '##INFO=<ID=JK,Number=1,Type=Float,Description="Jukes Cantor">' | \
        bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf | \
        bcftools view --no-version -Ou -S ^samples_xcl_list.txt | \
        bcftools +fill-tags --no-version -Ou -t ^Y,MT,chrY,chrM -- -t ExcHet,F_MISSING | \
        bcftools view --no-version -Ou -G | \
        bcftools annotate --no-version -Ob \
                -i 'FILTER!="." && FILTER!="PASS" || INFO/JK<.02 || INFO/ExcHet<1e-6 || INFO/F_MISSING>1-.97' \
                -x ^INFO/JK,^INFO/ExcHet,^INFO/F_MISSING | \
        tee $dir/$pfx.xcl.bcf | \
        bcftools index --force --output $dir/$pfx.xcl.bcf.csi; \
/bin/rm samples_xcl_list.txt

gives the error: [W::bcf_sr_add_reader] No BGZF EOF marker; file '/home/adarbar/projects/rrg-rauhm/adarbar/MoChA_test/MoChA_Run/MoChA_output/008ca9e31add6629e40396fdf930a2b7.MoChA.unphased.bcf' may be truncated

It seems that this also stops it from generating the index files, preventing Phasing from being able to run.

Can you shed some light on what the problem here might be, and what I can do to fix it?

freeseek commented 1 year ago

There are two solutions:

The second error is just the result of the fact that the first command failed and the output VCF was not correctly generated