benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Segmentation fault in learnErrors using bioconda dada2 1.10 #684

Closed apcamargo closed 5 years ago

apcamargo commented 5 years ago

I'm bumping into segmentation faults when trying to use learnErrors. This happened with version 1.8.0 and 1.10.0. I don't have this problem in version 1.6.0.

Here's the traceback when I was using version 1.10.0.

116200805 total bases in 474289 reads from 3 samples will be used for learning the error rates.
111156660 total bases in 617537 reads from 4 samples will be used for learning the error rates.

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: 
Traceback:
 1: dada_uniques(names(derep[[i]]$uniques), unname(derep[[i]]$uniques),     names(derep[[i]]$uniques) %in% c(priors, pseudo_priors),     err, qi, opts[["MATCH"]], opts[["MISMATCH"]], opts[["GAP_PENALTY"]],     opts[["USE_KMERS"]], opts[["KDIST_CUTOFF"]], opts[["BAND_SIZE"]],     opts[["OMEGA_A"]], opts[["OMEGA_P"]], opts[["OMEGA_C"]],     if (initializeErr) {dada_uniques(names(derep[[i]]$uniques), unname(derep[[i]]$uniques),     names(derep[[i]]$uniques) %in% c(priors, pseudo_priors),     err, qi, opts[["MATCH"]], opts[["MISMATCH"]], opts[["GAP_PENALTY"]],     opts[["USE_KMERS"]], opts[["KDIST_CUTOFF"]], opts[["BAND_SIZE"]],     opts[["OMEGA_A"]], opts[["OMEGA_P"]], opts[["OMEGA_C"]],         1    } else {        opts[["MAX_CLUST"]]    }, opts[["MIN_FOLD"]], opts[["MIN_HAMMING"]], opts[["MIN_ABUNDANCE"]],     TRUE, FALSE, opts[["VECTORIZED_ALIGNMENT"]], opts[["HOMOPOLYMER_GAP_PENALTY"]],     multithread, (verbose >= 2), opts[["SSE"]], opts[["GAPLESS"]],     if (initializeErr) {        1    } else {    opts[["GREEDY"]])
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)        opts[["MAX_CLUST"]]    }, opts[["MIN_FOLD"]], opts[["MIN_HAMMING"]], opts[["MIN_ABUNDANCE"]],     TRUE, FALSE, opts[["VECTORIZED_ALIGNMENT"]], opts[["HOMOPOLYMER_GAP_PENALTY"]],     multithread, (verbose >= 2), opts[["SSE"]], opts[["GAPLESS"]],     opts[["GREEDY"]])
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
 3: 
 3: learnErrors(filtRs.16s, MAX_CONSIST = 40, multithread = snakemake@threads)
learnErrors(filtRs.16s, MAX_CONSIST = 40, multithread = snakemake@threads)
An irrecoverable exception occurred. R is aborting now ...
An irrecoverable exception occurred. R is aborting now ...

Traceback:
 1: dada_uniques(names(derep[[i]]$uniques), unname(derep[[i]]$uniques),     names(derep[[i]]$uniques) %in% c(priors, pseudo_priors),     err, qi, opts[["MATCH"]], opts[["MISMATCH"]], opts[["GAP_PENALTY"]],     opts[["USE_KMERS"]], opts[["KDIST_CUTOFF"]], opts[["BAND_SIZE"]],     opts[["OMEGA_A"]], opts[["OMEGA_P"]], opts[["OMEGA_C"]],     if (initializeErr) {        1    } else {        opts[["MAX_CLUST"]]    }, opts[["MIN_FOLD"]], opts[["MIN_HAMMING"]], opts[["MIN_ABUNDANCE"]],     TRUE, FALSE, opts[["VECTORIZED_ALIGNMENT"]], opts[["HOMOPOLYMER_GAP_PENALTY"]],     multithread, (verbose >= 2), opts[["SSE"]], opts[["GAPLESS"]],     opts[["GREEDY"]])
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
 3: learnErrors(filtRs.16s, MAX_CONSIST = 40, multithread = snakemake@threads)
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: dada_uniques(names(derep[[i]]$uniques), unname(derep[[i]]$uniques),     names(derep[[i]]$uniques) %in% c(priors, pseudo_priors),     err, qi, opts[["MATCH"]], opts[["MISMATCH"]], opts[["GAP_PENALTY"]],     opts[["USE_KMERS"]], opts[["KDIST_CUTOFF"]], opts[["BAND_SIZE"]],     opts[["OMEGA_A"]], opts[["OMEGA_P"]], opts[["OMEGA_C"]],     if (initializeErr) {        1    } else {        opts[["MAX_CLUST"]]    }, opts[["MIN_FOLD"]], opts[["MIN_HAMMING"]], opts[["MIN_ABUNDANCE"]],     TRUE, FALSE, opts[["VECTORIZED_ALIGNMENT"]], opts[["HOMOPOLYMER_GAP_PENALTY"]],     multithread, (verbose >= 2), opts[["SSE"]], opts[["GAPLESS"]],     opts[["GREEDY"]])
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
 3: learnErrors(filtRs.16s, MAX_CONSIST = 40, multithread = snakemake@threads)
An irrecoverable exception occurred. R is aborting now ...
benjjneb commented 5 years ago

Can you provide the full output of sessionInfo() just before this error arises?

apcamargo commented 5 years ago

Sure!

I installed my environment through Bioconda.

[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dada2_1.10.0 Rcpp_1.0.0  

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2          pillar_1.3.1               
 [3] compiler_3.5.1              plyr_1.8.4                 
 [5] GenomeInfoDb_1.18.1         XVector_0.22.0             
 [7] bitops_1.0-6                tools_3.5.1                
 [9] zlibbioc_1.28.0             lattice_0.20-38            
[11] tibble_2.0.1                gtable_0.2.0               
[13] pkgconfig_2.0.2             rlang_0.3.1                
[15] Matrix_1.2-15               DelayedArray_0.8.0         
[17] parallel_3.5.1              GenomeInfoDbData_1.2.0     
[19] stringr_1.4.0               hwriter_1.3.2              
[21] dplyr_0.8.0.1               Biostrings_2.50.1          
[23] S4Vectors_0.20.1            IRanges_2.16.0             
[25] stats4_3.5.1                grid_3.5.1                 
[27] tidyselect_0.2.5            data.table_1.12.0          
[29] glue_1.3.0                  Biobase_2.42.0             
[31] R6_2.4.0                    BiocParallel_1.16.2        
[33] latticeExtra_0.6-28         reshape2_1.4.3             
[35] ggplot2_3.1.0               purrr_0.2.5                
[37] magrittr_1.5                Rsamtools_1.34.0           
[39] scales_1.0.0                matrixStats_0.54.0         
[41] BiocGenerics_0.28.0         GenomicRanges_1.34.0       
[43] GenomicAlignments_1.18.0    ShortRead_1.40.0           
[45] assertthat_0.2.0            SummarizedExperiment_1.12.0
[47] colorspace_1.4-0            stringi_1.2.4              
[49] RCurl_1.95-4.11             lazyeval_0.2.1             
[51] RcppParallel_4.4.1          munsell_0.5.0              
[53] crayon_1.3.4               
benjjneb commented 5 years ago

I installed my environment through Bioconda.

Could you try installing through the recommended channels (i.e. Bioconductor or the GitHub repository) and see if that solves the problem? Our installation page describes how to do both quite easily: https://benjjneb.github.io/dada2/dada-installation.html

apcamargo commented 5 years ago

I got no errors when running DADA2 from inside a R installation that I just compiled (I used version 3.5.2, but I don't think this is making any difference).

What may be causing this problem in Bioconda's build? As my server is using an old OS, I install pre-compiled packages via conda, so I'm willing to try to fix the issue in Bioconda's build.

valscherz commented 5 years ago

Hi,

I have made the following observation: adding the ibgcc-ng =7.2.0 conda pacakge to my conda environment definition solved the same issue. Interestingly, ibgcc-ng =7.3.0 was not working: so yes apparently it is a conda definition problem..

Hope it helps

apcamargo commented 5 years ago

@valscherz DADA2 1.10.0 is incompatible with libgcc-ng 7.2.0.

UnsatisfiableError: The following specifications were found to be in conflict:
  - bioconductor-dada2=1.10.0 -> bioconductor-biostrings[version='>=2.50.0,<2.51.0'] -> bioconductor-iranges[version='>=2.16.0,<2.17.0'] -> bioconductor-s4vectors[version='>=0.20.0,<0.21.0'] -> libgcc-ng[version='>=7.3.0']
  - libgcc-ng=7.2.0
valscherz commented 5 years ago

@apcamargo Interesting.. Then I don't know! All what I know is that I could fix that way the error with the 1.8.0 package. But I have not tested with the 1.10.0.

Have you tested to turn the multithreading off ? In my case that was preventing the error too...Maybe it could help someone to solve the problem?

apcamargo commented 5 years ago

@valscherz I could get DADA2 1.8.0 working by installing libgcc-ng 7.2.0 in the environment, just as you said. I any case, I want to solve the issue for version 1.10.0 as it's being compiled with GCC7 and I expect this will improve significantly conda's DADA2 build.

benjjneb commented 5 years ago

Moved this to a bug. Right now we don't know what is causing this, but it appears to be an issue with the bioconda builds of 1.10. As @apcamargo there has been a big shift to a new GCC compiler with these new builds, so this may be an issue arising from that.

Any thoughts or help on this much appreciated though! Diagnosing conda build issues gets out of my area of expertise.

mworkentine commented 5 years ago

I've seen this with 1.10 from bioconda as well. Seems to be showing up only when multithreading is turned on.

brendanf commented 5 years ago

I managed to track down that this is actually an issue (presumably library incompatibility) with RcppParallel in the newer builds of R on CondaForge (and maybe the Anaconda R also). To verify, try this:

library(RcppParallel)
setThreadOptions(numThreads=2)

This will probably reproduce the error. I solved it by pinning r-base=3.4.1 and installing dada2 using remotes::install_github.

pdcountway commented 5 years ago

I installed my environment through Bioconda.

Could you try installing through the recommended channels (i.e. Bioconductor or the GitHub repository) and see if that solves the problem? Our installation page describes how to do both quite easily: https://benjjneb.github.io/dada2/dada-installation.html

Hi - Did anyone having this problem try installing via the 'recommended channels' instead of Bioconda? I don't see a clear answer to this. We're having the same issues with learnErrors as others have already reported, working on our HPCC. Is it correct that the only work-around is to run learnErrors with multithread=FALSE?

Thanks!

epruesse commented 5 years ago

Can some one provide a minimal example? Attach the conda diagnostics (conda info, const list, conda env export) as files, as well as some script and necessary files to trigger the segfault.

benjjneb commented 5 years ago

As a quick comment, I am failing to replicate this error on my local machine. I installed dada2-1.10 thusly:

conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels defaults
conda create -n dada2-1.10 python=3 anaconda
source activate dada2-1.10
conda install -c bioconda bioconductor-dada2=1.10.0

Inside that environment, I get the following R versions:

R -e "R.version$version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')"

[1] "R version 3.5.1 (2018-07-02)" Loading required package: Rcpp [1] ‘1.10.0’ [1] ‘0.12.18’ [1] ‘4.4.1’

The output of conda info is:

Current conda install:

           platform : osx-64
      conda version : 4.3.29
   conda is private : False
  conda-env version : 4.3.29
conda-build version : not installed
     python version : 3.6.0.final.0
   requests version : 2.12.4
   root environment : /Users/bcallah/miniconda3  (writable)
default environment : /Users/bcallah/miniconda3/envs/dada2-1.10
   envs directories : /Users/bcallah/miniconda3/envs
                      /Users/bcallah/.conda/envs
      package cache : /Users/bcallah/miniconda3/pkgs
                      /Users/bcallah/.conda/pkgs
       channel URLs : https://repo.continuum.io/pkgs/main/osx-64
                      https://repo.continuum.io/pkgs/main/noarch
                      https://repo.continuum.io/pkgs/free/osx-64
                      https://repo.continuum.io/pkgs/free/noarch
                      https://repo.continuum.io/pkgs/r/osx-64
                      https://repo.continuum.io/pkgs/r/noarch
                      https://repo.continuum.io/pkgs/pro/osx-64
                      https://repo.continuum.io/pkgs/pro/noarch
                      https://conda.anaconda.org/bioconda/osx-64
                      https://conda.anaconda.org/bioconda/noarch
                      https://conda.anaconda.org/conda-forge/osx-64
                      https://conda.anaconda.org/conda-forge/noarch
        config file : /Users/bcallah/.condarc
         netrc file : None
       offline mode : False
         user-agent : conda/4.3.29 requests/2.12.4 CPython/3.6.0 Darwin/16.7.0 OSX/10.12.6    
            UID:GID : 502:20

So any specifics on how to make this segfault happen would be much appreciated!

mworkentine commented 5 years ago

Here's what's doing it for me:

conda create -n dada2_test bioconductor-dada2=1.10
conda activate dada2_test

Then:

R -e "R.version$version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')"

[1] "R version 3.5.1 (2018-07-02)" Loading required package: Rcpp [1] ‘1.10.0’ [1] ‘1.0.1’ [1] ‘4.4.1’

And conda info:


     active environment : dada2_test
    active env location : /home/mworken/miniconda3/envs/dada2_test
            shell level : 2
       user config file : /home/mworken/.condarc
 populated config files : /home/mworken/.condarc
          conda version : 4.6.8
    conda-build version : not installed
         python version : 3.6.7.final.0
       base environment : /home/mworken/miniconda3  (writable)
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/bioconda/linux-64
                          https://conda.anaconda.org/bioconda/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/mworken/miniconda3/pkgs
                          /home/mworken/.conda/pkgs
       envs directories : /home/mworken/miniconda3/envs
                          /home/mworken/.conda/envs
               platform : linux-64
             user-agent : conda/4.6.8 requests/2.21.0 CPython/3.6.7 Linux/3.10.0-514.26.2.el7.x86_64 centos/7.3.1611 glibc/2.17
                UID:GID : 160032:3453
             netrc file : None
           offline mode : False

This gives the same error as above.

Main differences from @benjjneb are system (osx vs linux), conda version and Rcpp version. I tried downgrading Rcpp to 0.12.18 - same error

mworkentine commented 5 years ago

One more note on this: Myself and a someone else I'm working with have generated this error - same system, 2 different datasets. I'll see if I can put together a minimal example as well.

mworkentine commented 5 years ago

Here's a minimal example that will segfault with the setup above. But only when learnErrors is run on both sets read files and only with multithreading turned on. Either one by itself finishes fine and multithread = FALSE also works.

Read files are from the dada2 tutorial.

library(dada2)

r1_files = c("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R1_001.fastq")
r2_files = c("MiSeq_SOP/F3D0_S188_L001_R2_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R2_001.fastq")

filt_r1 = c("filtered/F3D0_R1.fq.gz", "filtered/F3D141_R1.fq.gz")
filt_r2 = c("filtered/F3D0_R2.fq.gz", "filtered/F3D141_R2.fq.gz")

# so learnErrors doesn't choke on the raw read names
out = filterAndTrim(r1_files, filt_r1, r2_files, filt_r2)

err_r1 = learnErrors(filt_r1, multithread = 2)
err_r2 = learnErrors(filt_r2, multithread = 2)
kguay commented 5 years ago

Hi @benjneb and @epruesse,

Thank you both for looking at this. I am working with @pdcountway to get this working. I followed your instructions to the T with a new Anaconda install. My package version and environment seem to match up with yours (below). I'm not sure what is going on. We are using CentOS 7.6 and I am installing it as an environmental module. I have attached the files that @epruesse requested: dada-diagnostic.txt

As for reproducing the error, I am following the dada2 tutorial (except skipping the plots). Everything works up until the error rates:

errF <- learnErrors(filtFs, multithread=TRUE)

33514080 total bases in 139642 reads from 20 samples will be used for learning the error rates.

caught segfault address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback: 1: RcppParallel::setThreadOptions(numThreads = "auto") 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction, selfConsist = TRUE, multithread = multithread, verbose = verbose, MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...) 3: learnErrors(filtFs, multithread = TRUE)

R.version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')

[1] "R version 3.5.1 (2018-07-02)" [1] ‘1.10.0’ [1] ‘0.12.18’ [1] ‘4.4.1’

conda info

     active environment : dada2-1.10
    active env location : /mnt/modules/bin/dada-2/1.10/envs/dada2-1.10
            shell level : 1
       user config file : /root/.condarc
 populated config files : /root/.condarc
          conda version : 4.6.8
    conda-build version : 3.17.6
         python version : 3.7.1.final.0
       base environment : /mnt/modules/bin/dada-2/1.10  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://conda.anaconda.org/bioconda/linux-64
                          https://conda.anaconda.org/bioconda/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /mnt/modules/bin/dada-2/1.10/pkgs
                          /root/.conda/pkgs
       envs directories : /mnt/modules/bin/dada-2/1.10/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/4.6.8 requests/2.21.0 CPython/3.7.1 Linux/3.10.0-514.10.2.el7.x86_64 centos/7.6.1810 glibc/2.17
                UID:GID : 0:0
             netrc file : None
           offline mode : False

Thanks again!

epruesse commented 5 years ago

Just the env from @kguay: dada2-broken.txt Download, rename to something.yaml, then run

conda env create -n dada2-segfaulting --file something.yaml

(Sadly, GH does not allow attaching files ending in .yaml here...)

epruesse commented 5 years ago

Ok, please run this with your broken conda environment activated:

ldd $CONDA_PREFIX/lib/R/library/data.table/libs/datatable.so | grep -vE "$CONDA_PREFIX|(lib(c|rt|dl|m|pthread)|ld-linux-x86-64|linux-vdso).so"

In my case, it says

libgfortran.so.3 => not found

It looks like the package was linked to libgfortran.so.3 while it installs libgfortran.so.4, which in my case leads to no lib found, in other cases might lead to a mismatching one being loaded. Not sure if this is the cause - in theory any matching libgfortran should work. In practice, there might be ODR violations in the library that cause the segfault.

In either case, this is a broken env.

epruesse commented 5 years ago

Ok, just realized that it's data.table, not dada2 that was causing the loading problem for me for that env of yours.

@kguay Please set conda config --set channel_priority strict and try installing again and see whether the issue stays.

The problem is that r-data.table was moved to conda-forge because it's not life science specific. For some reason, in your env, it picked up the legacy bioconda package which wasn't build with the current dependencies in mind, so didn't require the right things. This is something the conda solver sometimes comes up with, sadly. One way of fixing it is to tell it to not ever use a package from a lower priority channel if it is available in a higher priority one. That's the setting above.

epruesse commented 5 years ago

Ok. One step closer. After installing r-data.table I got the one from conda-forge and with this env I can reproduce the crash.

R -d gdb -f breaks.R
[...]
(gdb) run
[...]
> err_r2 = learnErrors(filt_r2, multithread = 2)
3420302 total bases in 13659 reads from 2 samples will be used for learning the error rates.
[Thread 0x7fffc5e3a700 (LWP 9382) exited]

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffccecf200, 
    number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0, 
    context=context@entry=0x7fffd1c90c60 <tbb::internal::the_dummy_context>) at ../../src/tbb/scheduler.cpp:320
320 ../../src/tbb/scheduler.cpp: No such file or directory.
(gdb) bt
#0  tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffccecf200, 
    number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0, 
    context=context@entry=0x7fffd1c90c60 <tbb::internal::the_dummy_context>) at ../../src/tbb/scheduler.cpp:320
#1  0x00007fffd1c7862f in tbb::internal::generic_scheduler::generic_scheduler (this=this@entry=0x7fffccecf200, m=...)
    at ../../src/tbb/scheduler.cpp:100
#2  0x00007fffd1c7a8ea in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::custom_scheduler (
    m=..., this=0x7fffccecf200) at ../../src/tbb/custom_scheduler.h:59
#3  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler (m=...)
    at ../../src/tbb/custom_scheduler.h:120
#4  0x00007fffd1c78833 in tbb::internal::allocate_scheduler (m=...) at ../../src/tbb/scheduler.cpp:42
#5  tbb::internal::generic_scheduler::create_master (a=0x7fffccecba00) at ../../src/tbb/scheduler.cpp:1161
#6  0x00007fffd1c7385e in tbb::internal::governor::init_scheduler (num_threads=num_threads@entry=2, 
    stack_size=stack_size@entry=0, auto_init=auto_init@entry=false) at ../../src/tbb/governor.cpp:200
#7  0x00007fffd1c7393e in tbb::task_scheduler_init::initialize (this=0x55555629c490, number_of_threads=2, 
    thread_stack_size=0) at ../../src/tbb/governor.cpp:320
#8  0x00007fffd213e1c2 in setThreadOptions ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/library/RcppParallel/libs/RcppParallel.so
#9  0x00007ffff7c31e46 in do_dotcall ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#10 0x00007ffff7c6a5f4 in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#11 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#12 0x00007ffff7c7c020 in R_execClosure ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#13 0x00007ffff7c719fc in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#14 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#15 0x00007ffff7c7c020 in R_execClosure ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#16 0x00007ffff7c719fc in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#17 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#18 0x00007ffff7c7c020 in R_execClosure ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#19 0x00007ffff7c7a55c in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#20 0x00007ffff7c7ec8a in do_set () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#21 0x00007ffff7c7a7e4 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#22 0x00007ffff7cabd36 in Rf_ReplIteration ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#23 0x00007ffff7cac19a in R_ReplConsole ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#24 0x00007ffff7cac249 in run_Rmainloop ()
   from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#25 0x000055555555506d in main ()
#26 0x00007ffff791b09b in __libc_start_main (main=0x555555555050 <main>, argc=3, argv=0x7fffffffa878, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffa868)
    at ../csu/libc-start.c:308
#27 0x000055555555509d in _start () at ../sysdeps/x86_64/elf/start.S:103
epruesse commented 5 years ago

Continued from above:

(gdb) c
Continuing.

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: RcppParallel::setThreadOptions(numThreads = multithread)
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
 3: learnErrors(filt_r2, multithread = 2)
An irrecoverable exception occurred. R is aborting now ...

So on the C side, we have a crash inside of TBB somewhere, with # 8 setThreadOptions the first function outside of R scope. On the R side, that's dada() calling RcppParallel::setThreadOptions(numThreads = multithread).

Don't ask me what's going on there. But if I run the code above with

err_r1 = learnErrors(filt_r1, multithread = 2)
err_r2 = learnErrors(filt_r2) #, multithread = 2)

it does not segfault.

It also does not segfault if I run R -g valgrind -f breaks.R - although it does take forever.

So we've got a concurrency bug here... hard to trace. I think we can assume that TBB and RcppParallel have a lesser likelihood than dada2 of having bugs with regard to this.

epruesse commented 5 years ago

Ok, if I use mulithread=TRUE, the first time runs fine again and the second call to learnErrors crashes in dada2_uniques. It's reproducible in valgrind though in GDB it crashes some place else.

@benjjneb This is starting to get out of scope for me. Can you reproduce this? It probably needs building dada2 (and R) with -fsanitize=asan & friends (see https://cloud.r-project.org/doc/manuals/r-devel/R-exts.html#Debugging) to see where you've got the concurrency bug.

kguay commented 5 years ago

Thanks for working on this @epruesse. I just read through all of you replies. I don't think that there are any action items for me, but I'm glad that you were able to reproduce the problem. Let me know if you need anything else from me.

benjjneb commented 5 years ago

I was finally able to replicate this error. Hooray! (sort of) Required moving to a Linux system (Ubuntu) as it wasn't happening on OSX.

epruesse commented 5 years ago

@benjjneb Ping me if I can help. The next challenge will be getting the crash with debug options so that you can track a little better where it's breaking.

FYI: Below are the options that Bioconda builds it's binaries with. They are set by the compiler packages provided by anaconda itself. If you install gxx_linux-64 or clangxx_osx-64 into an environment, they are set when you conda activate it. That might be the best way for you to reproduce the build as closely as possible.

I think the anaconda guys started with the RedHat defaults and added some conda specific flags. Not sure they are optimal, but we haven't even tried second guessing them because we've got enough things to worry about.

LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro \
-Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections
CXXFLAGS=-fvisibility-inlines-hidden -std=c++17 \
-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize \
 -fPIC -fstack-protector-strong -fno-plt -O2 \
 -ffunction-sections -pipe

Linker options explained:

kguay commented 5 years ago

We just discovered that the learnErrors function works with anything other than multithread=2 (or TRUE). So, 1, 3, 4, and 100 all work.

apcamargo commented 5 years ago

@kguay I've been getting this error using multithread = 80.

kguay commented 5 years ago

@apcamargo How strange. 80 works for me, but 2 still crashes.

pdcountway commented 5 years ago

@kguay @apcamargo It just crashed for me with multithread=16, but not until I ran the learnErrors command for the second time (on the reverse reads). It worked fine the first time I ran it (on the forward reads). I believe that @epruesse reported similar behavior.

thermokarst commented 5 years ago

Hey @benjjneb, just wanted to let you know @ebolyen and I have blocked some time out to dive into this issue a bit more, will keep you posted. Thanks!

epruesse commented 5 years ago

@benjjneb @thermokarst @ebolyen I've got a goodie for you that might help debugging and testing:

Our infrastructure now makes package previews available for all PRs opened at bioconda/bioconda-recipes. You get direct download links for the packages and a conda install -c bioconda -c conda-orge -c https://.... snippet that allows you to install the package locally. It's effectively a one-PR micro-channel, so you can even test the effects on whole environments.

You are welcome to use this feature for debugging dada2 - e.g. by creating a PR with just a build number bump and debug flags set. If you need to change the source-urls to point to non-release tar balls, you may have to also add a flag to override the linter to get through to the build stages.

Hope that helps a little. I'm really tight on time and have to force myself to not start digging into this. Ping me if you need help with the Bioconda side of things though!

thermokarst commented 5 years ago

Thanks @epruesse, that is super awesome!

Recreating what @epruesse did above, with valgrind (just dumping here so that I don't lose this):

R -d valgrind
==23903== Memcheck, a memory error detector
==23903== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==23903== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==23903== Command: /home/matthew/.conda/envs/dada2_debug/lib/R/bin/exec/R
==23903==

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(dada2)
Loading required package: Rcpp

>
> r1_files = c("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R1_001.fastq")
> r2_files = c("MiSeq_SOP/F3D0_S188_L001_R2_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R2_001.fastq")
>
> filt_r1 = c("filtered/F3D0_R1.fq.gz", "filtered/F3D141_R1.fq.gz")
> filt_r2 = c("filtered/F3D0_R2.fq.gz", "filtered/F3D141_R2.fq.gz")
>
> # so learnErrors doesn't choke on the raw read names
> out = filterAndTrim(r1_files, filt_r1, r2_files, filt_r2)
> err_r1 = learnErrors(filt_r1, multithread = 2)
3426981 total bases in 13659 reads from 2 samples will be used for learning the error rates.
> err_r2 = learnErrors(filt_r2, multithread = 2)
3420302 total bases in 13659 reads from 2 samples will be used for learning the error rates.
==23903== Invalid read of size 8
==23903==    at 0x23C9A548: tbb::internal::generic_scheduler::allocate_task(unsigned long, tbb::task*, tbb::task_group_context*) (scheduler.cpp:320)
==23903==    by 0x23C9A62E: tbb::internal::generic_scheduler::generic_scheduler(tbb::internal::market&) (scheduler.cpp:100)
==23903==    by 0x23C9C8E9: custom_scheduler (custom_scheduler.h:59)
==23903==    by 0x23C9C8E9: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler(tbb::internal::market&) (custom_scheduler.h:120)
==23903==    by 0x23C9A832: allocate_scheduler (scheduler.cpp:42)
==23903==    by 0x23C9A832: tbb::internal::generic_scheduler::create_master(tbb::internal::arena*) (scheduler.cpp:1161)
==23903==    by 0x23C9585D: tbb::internal::governor::init_scheduler(int, unsigned long, bool) (governor.cpp:200)
==23903==    by 0x23C9593D: tbb::task_scheduler_init::initialize(int, unsigned long) (governor.cpp:320)
==23903==    by 0x23CF51C1: setThreadOptions (in /home/matthew/.conda/envs/dada2_debug/lib/R/library/RcppParallel/libs/RcppParallel.so)
==23903==    by 0x4941E45: do_dotcall (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903==    by 0x497A5F3: bcEval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903==    by 0x498A38F: Rf_eval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903==    by 0x498C01F: R_execClosure (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903==    by 0x49819FB: bcEval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903==  Address 0xfffffffffffffff7 is not stack'd, malloc'd or (recently) free'd
==23903==

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: RcppParallel::setThreadOptions(numThreads = multithread)
 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,     selfConsist = TRUE, multithread = multithread, verbose = verbose,     MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
 3: learnErrors(filt_r2, multithread = 2)
benjjneb commented 5 years ago

Hey @benjjneb, just wanted to let you know @ebolyen and I have blocked some time out to dive into this issue a bit more, will keep you posted. Thanks!

If you can let me know in advance at all I will try to block out the same time as well and be available online and to try out code changes. I admit I'm struggling with this one myself.

epruesse commented 5 years ago

The catch with the error above is that it's the symptom, not the cause. 0xff....7 is -9, not a likely memory address. It's curious that everyone sees that address. Sounds like something writes that number into places it doesn't belong. That itself might yet be another symptom though, so might not be the easiest "thread" (no pun) to pull on.

I saw that you have written some custom C style data structures and allocation functions. If you share and access those from multiple threads, you have to be really careful about concurrency. Writing thread safe data structures is a seriously complicated art. Consider that something as unassuming as ++i is not thread safe (unless i is std::atomic<int>, which might just use a mutex internally to make sure you're ok). I don't have the time to review the code completely, but start looking at the places shared by threads, the writes, allocations and deallocations. You probably have to throw in a few mutexes - if done right, they won't cost that much time.

I might be wrong. I've been before. This type of bug is hell. All I know for certain that is that there are similar ones lurking in SINA, waiting to be "expressed".

ebolyen commented 5 years ago

Googling about, I found some highly relevant links I think:

https://github.com/cms-sw/cmssw/issues/14409 https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/636519

It looks like TBB fails with "invalid read of size 8" relatively regularly, and it looks at first glance like an initialization out-of-order problem.

ebolyen commented 5 years ago

It is also possible that TBB is linked wrong, but that seems less likely as I would expect half of bioconda to have already exploded.

epruesse commented 5 years ago

TBB is from Conda-Forge, so not even just Bioconda. Moreover, IIRC the recipe was actually provided by Intel engineers. Still, just trying a few different versions of TBB might be worth exploring. Their ABI is very stable, so probably not too complicated to do.

ebolyen commented 5 years ago

Turns out rcppparallel vendors TBB, so its not a link issue. Double init still seems like a good option, so I'm going to look through the rcppparallel code to see if I can find a place fo that to happen. @thermokarst is testing to see if compiling a new rcppparallel helps at all.

thermokarst commented 5 years ago

RcppParallel 4.4.2 does not segfault for me when running the MWE above.

ebolyen commented 5 years ago

I wasn't able to find the github commit that relates to 4.3.20, but I pulled the source down and compared it to master. (EDIT: as usual, I'm looking at an old version masked by a different channel :-/) Nothing is different with respect to TBB init (and they are doing it the right way), so I don't think DADA2 can cause a double init here. Also after walking through DADA2's C++, I don't see how it would have even happened. The rcpp::setThreadOptions call occurs well before anything interesting even happens...

@thermokarst, if you compile the old RcppParallel, what do we see?

thermokarst commented 5 years ago

Steps to reproduce https://github.com/benjjneb/dada2/issues/684#issuecomment-482685395

conda create -n dada2_debug \
  -c conda-forge \
  -c bioconda \
  -c defaults \
  --override-channels \
  bioconductor-dada2=1.10
conda activate dada2_debug
# Using RcppParallel 4.4.1 (from conda-forge)
> library(dada2)
> filterAndTrim("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "filtered/F3D0_R1.fq.gz")
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
# segfaults
# upgrade RcppParallel
conda uninstall --force r-rcppparallel
R -e 'install.packages("RcppParallel", repos="http://cran.us.r-project.org")'
# Using RcppParallel 4.4.2 (from CRAN)
> library(dada2)
> filterAndTrim("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "filtered/F3D0_R1.fq.gz")
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
# does not segfault
thermokarst commented 5 years ago

@thermokarst, if you compile the old RcppParallel, what do we see?

Compiling RcppParallel 4.4.1 from source:

conda uninstall --force r-rcppparallel
R -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/RcppParallel/RcppParallel_4.4.1.tar.gz", repos=NULL, type="source")'

The MWE example above completes without segfaulting. It looks like this is a conda build environment related issue, since the same version of RcppParallel worked just fine when built locally. Getting closer!

ebolyen commented 5 years ago

Extremely relevant: https://github.com/01org/tbb/issues/83

Since RcppParallel vendors TBB, it is probably being compiled without the workaround needed for GCC7.

ebolyen commented 5 years ago

Ok, we're pretty convinced its a missing compile flag for linux (-flifetime-dse=1).

We're going to test this theory by building the conda-forge recipe, via the bioconda tools (since that's really easy). We're expecting to see the same issue with the local build.

Then we'll update the recipe to use:

 $R CMD INSTALL --configure-args="CXXFLAGS='-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize  -fPIC -fstack-protector-strong -fno-plt -O2  -ffunction-sections -pipe -flifetime-dse=1'" --build .

Which uses the same CXXFLAGS, but adds -flifetime-dse=1.

@epruesse, assuming this works, there's probably a better way to set that compiler flag. Do you have any thoughts on that?

thermokarst commented 5 years ago

Update, we are actually building and testing locally using the conda-forge tooling, since the package recipe is coming from conda-forge (plus, tooling exists here).

ebolyen commented 5 years ago

We have a build that fails locally!

ebolyen commented 5 years ago

OK, so we are feeling pretty confident we know whats going on.

Here's what we know:

The -fliftime-dse is a part of the Makefile. However it is detected by probing gcc

We see during the conda build:

make -e -C "../build/lib_release"  -r -f ../../build/Makefile.tbb cfg=release
make[2]: Entering directory '$SRC_DIR/src/tbb/build/lib_release'
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found

Which looks a lot like those feature-flag tests failing to produce anything useful. Meaning we don't end up with the -flifetime-dse=1 flag (and probably other important flags).

This is why a compile from source (outside of conda) seems to work fine. It has the gcc binary and detects the appropriate flags to use.


What we're struggling to do is convince make to use the conda binary defined in $CC instead of a binary gcc. This seems like a tractable problem however, so we're pretty optimistic.

That said, we're stepping away for the weekend, if anyone wants to try fixing the RcppParallel build in the interim, please let us know how it goes! We'll see if we can pick this back up on Monday.

epruesse commented 5 years ago

Couldn't you just add -flifetime-dse=1 to the CXXFLAGS for Rcpp and dada2 to see if this fixes things?