ARUP-NGS / BMFtools

Barcoded Molecular Families
MIT License
22 stars 8 forks source link

bmftools err main produces a segmentation fault #91

Closed BrettKennedy closed 7 years ago

BrettKennedy commented 8 years ago

When trying to calculate the error rate for non-barcoded undetermined reads aligned to a phiX genome, bmftools err main produces a segmentation fault. It appears to complete, and produces output into the proper files, but segfaults instead of finishing cleanly. Below are the command lines used to produce the data and to run bmftools err

bwa mem -Mt 40 -R "@RG\tID:phix" <phix.fa> <read1.fq.gz> <read2.fq.gz> | samtools sort -o phix.bam -T temp -O bam -m 1G -@ 40

bmftools err main -o temp.txt -3 3.txt -f f.txt -n n.txt -c c.txt -g g.txt -P -a 1 <phix.fa> phix.bam

The stdout produced by this command line is:

[err_main_core] Records read: 1000000.
[err_main_core] Records read: 2000000.
Segmentation fault

When run through valgrind I see the same error repeated thousands of times:

=188112== Invalid read of size 8
==188112==    at 0x435B1E: bmf::err_main_core(char*, __faidx_t*, bmf::fullerr_t*, htsFormat*) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==188112==    by 0x446B23: bmf::err_main_main(int, char**) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==188112==    by 0x5C42D5C: (below main) (in /lib64/libc-2.12.so)
==188112==  Address 0x627c058 is 8 bytes after a block of size 544 alloc'd
==188112==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==188112==    by 0x4334CB: bmf::readerr_init(unsigned long) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==188112==    by 0x446A52: bmf::err_main_main(int, char**) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==188112==    by 0x5C42D5C: (below main) (in /lib64/libc-2.12.so)
dnbaker commented 8 years ago

Thanks for the report!

Do you have a commit or release tag to refer to, or is this on dev?

dnbaker commented 8 years ago

I think the problem is that higher quality scores are being observed than expected. lib/rescaler.h has

#define NQSCORES 39uL

This is under the assumption that quality scores 2 through 40 are being assigned.

Modifying NQSCORES to be 50 (with a max expected of 51) solves my problem on a Platinum Genomes dataset (where the highest quality score I observe is 41), though 40uL would also have worked.

This either needs to be configured or bootstrapped off the data. I propose simply setting a liberal threshold and always allocating more memory than needed. (e.g., change lib/rescaler.h to instead say

#define NQSCORES 45uL

Otherwise, we'd have to stream through the whole dataset to know how much space to allocate.

I'll update my fork and send it your way.

BrettKennedy commented 8 years ago

Thanks for the fast reply! Unfortunately, the issue persists. I re-ran on a smaller subset after pulling from the dev branch of your fork. below is the complete command line and valgrind output. I'm just running these on some pretty normal undetermined reads from a MiSeq run.

valgrind bmftools err main -o temp.txt -3 3.txt -f f.txt -n n.txt -c c.txt -g g.txt -P -a 1 ~/projects/Resources/PhiX/PhiX.fa phix.bam

==41825== Memcheck, a memory error detector
==41825== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==41825== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==41825== Command: bmftools err main -o temp.txt -3 3.txt -f f.txt -n n.txt -c c.txt -g g.txt -P -a 1 /uufs/chpc.utah.edu/common/home/u0521163/projects/Resources/PhiX/PhiX.fa phix.bam
==41825==
==41825== Invalid read of size 8
==41825==    at 0x435751: bmf::err_main_core(char*, __faidx_t*, bmf::fullerr_t*, htsFormat*) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x44D713: bmf::err_main_main(int, char**) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x5C42D5C: (below main) (in /lib64/libc-2.12.so)
==41825==  Address 0x626edc0 is 0 bytes after a block of size 368 alloc'd
==41825==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==41825==    by 0x4331FB: bmf::readerr_init(unsigned long) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x44D642: bmf::err_main_main(int, char**) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x5C42D5C: (below main) (in /lib64/libc-2.12.so)
==41825==
==41825==
==41825== Process terminating with default action of signal 11 (SIGSEGV)
==41825==  Access not within mapped region at address 0x7273158
==41825==    at 0x435751: bmf::err_main_core(char*, __faidx_t*, bmf::fullerr_t*, htsFormat*) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x44D713: bmf::err_main_main(int, char**) (in /uufs/chpc.utah.edu/common/home/u0521163/bin/BMFtools/bmftools)
==41825==    by 0x5C42D5C: (below main) (in /lib64/libc-2.12.so)
==41825==  If you believe this happened as a result of a stack
==41825==  overflow in your program's main thread (unlikely but
==41825==  possible), you can try to increase the size of the
==41825==  main thread stack using the --main-stacksize= flag.
==41825==  The main thread stack size used in this run was 16777216.
==41825==
==41825== HEAP SUMMARY:
==41825==     in use at exit: 644,502 bytes in 1,045 blocks
==41825==   total heap usage: 1,075 allocs, 30 frees, 1,031,488 bytes allocated
==41825==
==41825== LEAK SUMMARY:
==41825==    definitely lost: 19,872 bytes in 54 blocks
==41825==    indirectly lost: 0 bytes in 0 blocks
==41825==      possibly lost: 151,257 bytes in 413 blocks
==41825==    still reachable: 473,373 bytes in 578 blocks
==41825==         suppressed: 0 bytes in 0 blocks
==41825== Rerun with --leak-check=full to see details of leaked memory
==41825==
==41825== For counts of detected and suppressed errors, rerun with: -v
==41825== ERROR SUMMARY: 933 errors from 1 contexts (suppressed: 6 from 6)

Segmentation fault
dnbaker commented 8 years ago

Thanks for the quick response!

Strange, my Platinum Genomes dataset isn't emitting anything under valgrind. Can you send me a subset of your data for testing?

I'll track it with the debug build, which avoids inlining and makes it easier to hunt these kinds of things down.

Thanks!

BrettKennedy commented 8 years ago

Huh. Okay. Before I take more of your time, I'll run it on some additional data and make sure there isn't anything strange about those FASTQs. Thanks for your help!

dnbaker commented 8 years ago

A simple check you could do is

import pysam
import sys
import itertools
if __name__ == "__main__":
    if not sys.argv[1:]:
        sys.stderr.write("Usage: python %s <in.bam>\n" % sys.argv[0])
        sys.exit(1)
    sys.stdout.write("Maximum observed quality score: %i\n" % max(itertools.chain.from_iterable(b.query_qualities for b in pysam.AlignmentFile(sys.argv[1]))))
    sys.exit(0)

This outputs Maximum observed quality score: 41 on my 2.5 million read subset of NA12878.

dnbaker commented 8 years ago

Any update on this?

BrettKennedy commented 8 years ago

The problem seems inconsistent across runs. It may be related to whether the sequencer "bins" the quality scores or not. I was gone at ASHG for about a week, so I should have time now to dig into a it a bit.

Just need to test some FASTQs from each machine to see whether it's consistent.

dnbaker commented 7 years ago

If this has been addressed, can this be closed?

BrettKennedy commented 7 years ago

@dnbh Issues has been resolved by #92 , thanks Daniel!