artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
112 stars 68 forks source link

Error when running with too few reads/very low coverage #91

Open Coppini opened 3 years ago

Coppini commented 3 years ago

When running Artic on a sample with too few reads, resulting in too small coverage, Longshot automatically calculates the maximum coverage as 0, and fails. We're having issues with that in an automatic pipeline, where some of our samples keep failing due to that, since they have too few reads (as expected, since they were negative samples).

Running: longshot -P 0 -F -A --no_haps --bam barcode91.primertrimmed.rg.sorted.bam --ref primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.reference.fasta --out barcode91.merged.vcf --potential_variants barcode91.merged.vcf.gz
Command failed:longshot -P 0 -F -A --no_haps --bam barcode91.primertrimmed.rg.sorted.bam --ref primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.reference.fasta --out barcode91.merged.vcf --potential_variants barcode91.merged.vcf.gz

Already reported it to them at https://github.com/pjedge/longshot/issues/72, but I agree with them that the best solution would be to simply not use the -A flag (or, alternatively, to ignore the error in the Artic pipeline when it happens, and create an empty file instead).


I tried running the branch in 1.3.0-dev as well, but end up with an error in the artic-tools check_vcf step:

Command failed:artic-tools check_vcf --summaryOut barcode91.vcfreport.txt --vcfOut barcode91.merged.filtered.vcf barcode91.merged.vcf.gz primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.scheme.bed 2> barcode91.vcfcheck.log

It seems the problem appears when the VCF file is empty:

bcftools view barcode91.merged.vcf.gz
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##medaka_version=1.2.3
##INFO=<ID=DP,Number=1,Type=Integer,Description="Depth of reads at pos">
##INFO=<ID=DPS,Number=2,Type=Integer,Description="Depth of reads at pos by strand (fwd, rev)">
##INFO=<ID=DPSP,Number=1,Type=Integer,Description="Depth of reads spanning pos +-25">
##INFO=<ID=SR,Number=.,Type=Integer,Description="Depth of spanning reads by strand which best align to each allele (ref fwd, ref rev, alt1 fwd, alt1 rev, etc.)">
##INFO=<ID=AR,Number=2,Type=Integer,Description="Depth of ambiguous spanning reads by strand which align equally well to all alleles (fwd, rev)">
##INFO=<ID=SC,Number=.,Type=Integer,Description="Total alignment score to each allele of spanning reads by strand (ref fwd, ref rev, alt1 fwd, alt1 rev, etc.) aligned with parasail match 5, mismatch -4, open 5, extend 3">
##INFO=<ID=Pool,Number=1,Type=String,Description="The pool name">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Medaka genotype.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Medaka genotype quality score">
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view barcode91.merged.vcf.gz; Date=Mon Aug 23 12:36:40 2021
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE

Running artic-tools check_vcf ends up on a Segmentation fault (core dumped) when reading the empty VCF:

$ artic-tools check_vcf --summaryOut barcode91.vcfreport.txt --vcfOut barcode91.merged.filtered.vcf barcode91.merged.vcf.gz primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.scheme.bed
[12:35:34] [artic-tools::check_vcf] starting VCF checker
[12:35:34] [artic-tools::check_vcf] reading scheme
[12:35:34] [artic-tools::check_vcf] collecting scheme stats
[12:35:34] [artic-tools::check_vcf]     primer scheme file: primer_schemes/SARS-CoV-2/V1200/SARS-CoV-2.scheme.bed
[12:35:34] [artic-tools::check_vcf]     reference sequence: MN908947.3
[12:35:34] [artic-tools::check_vcf]     number of pools:    2
[12:35:34] [artic-tools::check_vcf]     number of primers:  58 (includes 0 alts)
[12:35:34] [artic-tools::check_vcf]     minimum primer size:    22
[12:35:34] [artic-tools::check_vcf]     maximum primer size:    30
[12:35:34] [artic-tools::check_vcf]     number of amplicons:    29
[12:35:34] [artic-tools::check_vcf]     mean amplicon size: 1091
[12:35:34] [artic-tools::check_vcf]     maximum amplicon size:  1177
[12:35:34] [artic-tools::check_vcf]     scheme ref. span:   30-29790
[12:35:34] [artic-tools::check_vcf]     scheme overlaps:    6.5591393%
[12:35:34] [artic-tools::check_vcf] setting parameters
[12:35:34] [artic-tools::check_vcf]     output report: barcode91.vcfreport.txt
[12:35:34] [artic-tools::check_vcf]     filtering variants: true
[12:35:34] [artic-tools::check_vcf]     output VCF: barcode91.merged.filtered.vcf
[12:35:34] [artic-tools::check_vcf]     minimum quality threshold: 10.0
[12:35:34] [artic-tools::check_vcf] reading VCF file
Segmentation fault (core dumped)
Coppini commented 3 years ago

Running with artic-tools==0.3.1 fixes the artic-tools check_vcf error on artic==1.3.0-dev. It might be just a matter of updating the artic-tools dependency to the newest version.

Coppini commented 3 years ago

On v1.2.1, the problem is on Longshot

On v1.3.0, this is actually a duplicate of https://github.com/artic-network/fieldbioinformatics/issues/88

knewl commented 7 months ago

The above error is occurring regularly using ARTIC v1.2.4 in samples with very low coverage (roughly 0 to 5 aligned reads).

Command failed:longshot -P 0 -F -A --no_haps --bam sample_name.primertrimmed.rg.sorted.bam --ref primer-schemes/SARS-CoV-2/V5.3.2/SARS-CoV-2.reference.fasta --out sample_name.merged.vcf --potential_variants sample_name.merged.vcf.gz

As mentioned above, a solution would be to remove the '-A' (automatic max coverage) flag from the longshot command or to make it optional.