broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

BWA and FML bindings should check return value of system calls #3209

Open droazen opened 7 years ago

droazen commented 7 years ago

We are currently plagued with cryptic intermittent failures coming from the BWA and FML bindings in Travis CI. These usually manifest as a simple "exited with code 137" (ie., killed by signal 9) error, but sometimes we get an explicit segfault or out-of-memory error. Examples:

�[31mFAILURE: �[39m�[31mBuild failed with an exception.�[39m
* What went wrong:
Execution failed for task ':test'.
�[33m> �[39mProcess 'Gradle Test Executor 1' finished with non-zero exit value 137
:test[M::bwa_idx_load_from_disk] read 0 ALT contigs
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000715180000, 719847424, 0) failed; error='Cannot allocate memory' (errno=12)

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 719847424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/travis/build/broadinstitute/gatk/hs_err_pid11513.log
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f27ebfe7d9a, pid=11455, tid=0x00007f27e87e5700
#
# JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libfml.6198146539708364717.jnilib+0xed9a]  rld_itr_init+0x4a
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd2680a350c, pid=11685, tid=0x00007fd2b02bf700
#
# JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libbwa.5694772191018335324.jnilib+0x850c]  bwa_mem2idx+0xcc

The underlying issue in these cases is likely either "out of memory" or, perhaps in the case of the seg faults, "file not found" or "malformed file", but we could greatly improve our ability to interpret Travis failures if we were more careful about checking return values from system calls. Eg., in the function below from the BWA bindings we could check the return values of the mmap() and calloc() calls, and die with an appropriate error message if they fail:

bwaidx_t* jnibwa_openIndex( int fd ) {
    struct stat statBuf;
    if ( fstat(fd, &statBuf) == -1 ) return 0;
    uint8_t* mem = mmap(0, statBuf.st_size, PROT_READ, MAP_SHARED, fd, 0);
    close(fd);
    bwaidx_t* pIdx = calloc(1, sizeof(bwaidx_t));
    bwa_mem2idx(statBuf.st_size, mem, pIdx);
    pIdx->is_shm = 1;
    mem_fmt_fnc = &fmt_BAMish;
    bwa_verbose = 0;
    return pIdx;
}
droazen commented 7 years ago

For @tedsharpe

tedsharpe commented 7 years ago

@droazen I have a PR for gatk-bwa-mem that adds a footer to the image file so that we can test integrity. I also added code to test every (I think) call that returns an error indication, and pass this info up the chain. Could you review the PR or delegate, please?

tedsharpe commented 7 years ago

@droazen Ditto for gatk-fermi-lite: there's a new PR that adds some error handling. However, that artifact is hooked up to Travis, which fails. So I'll probably need some help from @lbergelson to get that sorted out -- gradle is missing an assemble verb. Can you review the PR or assign a reviewer, please?

droazen commented 7 years ago

@tedsharpe I'm going to nominate @TedBrookings to do a first-pass review to check the system calls, as based on my conversation with him the other day I think he might be a good person to review. I'll do a review pass as well to have a look at the CRC stuff, etc.