Closed apcamargo closed 5 years ago
Can you provide the full output of sessionInfo()
just before this error arises?
Sure!
I installed my environment through Bioconda.
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dada2_1.10.0 Rcpp_1.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 pillar_1.3.1
[3] compiler_3.5.1 plyr_1.8.4
[5] GenomeInfoDb_1.18.1 XVector_0.22.0
[7] bitops_1.0-6 tools_3.5.1
[9] zlibbioc_1.28.0 lattice_0.20-38
[11] tibble_2.0.1 gtable_0.2.0
[13] pkgconfig_2.0.2 rlang_0.3.1
[15] Matrix_1.2-15 DelayedArray_0.8.0
[17] parallel_3.5.1 GenomeInfoDbData_1.2.0
[19] stringr_1.4.0 hwriter_1.3.2
[21] dplyr_0.8.0.1 Biostrings_2.50.1
[23] S4Vectors_0.20.1 IRanges_2.16.0
[25] stats4_3.5.1 grid_3.5.1
[27] tidyselect_0.2.5 data.table_1.12.0
[29] glue_1.3.0 Biobase_2.42.0
[31] R6_2.4.0 BiocParallel_1.16.2
[33] latticeExtra_0.6-28 reshape2_1.4.3
[35] ggplot2_3.1.0 purrr_0.2.5
[37] magrittr_1.5 Rsamtools_1.34.0
[39] scales_1.0.0 matrixStats_0.54.0
[41] BiocGenerics_0.28.0 GenomicRanges_1.34.0
[43] GenomicAlignments_1.18.0 ShortRead_1.40.0
[45] assertthat_0.2.0 SummarizedExperiment_1.12.0
[47] colorspace_1.4-0 stringi_1.2.4
[49] RCurl_1.95-4.11 lazyeval_0.2.1
[51] RcppParallel_4.4.1 munsell_0.5.0
[53] crayon_1.3.4
I installed my environment through Bioconda.
Could you try installing through the recommended channels (i.e. Bioconductor or the GitHub repository) and see if that solves the problem? Our installation page describes how to do both quite easily: https://benjjneb.github.io/dada2/dada-installation.html
I got no errors when running DADA2 from inside a R installation that I just compiled (I used version 3.5.2, but I don't think this is making any difference).
What may be causing this problem in Bioconda's build? As my server is using an old OS, I install pre-compiled packages via conda, so I'm willing to try to fix the issue in Bioconda's build.
Hi,
I have made the following observation: adding the ibgcc-ng =7.2.0 conda pacakge to my conda environment definition solved the same issue. Interestingly, ibgcc-ng =7.3.0 was not working: so yes apparently it is a conda definition problem..
Hope it helps
@valscherz DADA2 1.10.0 is incompatible with libgcc-ng 7.2.0.
UnsatisfiableError: The following specifications were found to be in conflict:
- bioconductor-dada2=1.10.0 -> bioconductor-biostrings[version='>=2.50.0,<2.51.0'] -> bioconductor-iranges[version='>=2.16.0,<2.17.0'] -> bioconductor-s4vectors[version='>=0.20.0,<0.21.0'] -> libgcc-ng[version='>=7.3.0']
- libgcc-ng=7.2.0
@apcamargo Interesting.. Then I don't know! All what I know is that I could fix that way the error with the 1.8.0 package. But I have not tested with the 1.10.0.
Have you tested to turn the multithreading off ? In my case that was preventing the error too...Maybe it could help someone to solve the problem?
@valscherz I could get DADA2 1.8.0 working by installing libgcc-ng 7.2.0 in the environment, just as you said. I any case, I want to solve the issue for version 1.10.0 as it's being compiled with GCC7 and I expect this will improve significantly conda's DADA2 build.
Moved this to a bug. Right now we don't know what is causing this, but it appears to be an issue with the bioconda builds of 1.10. As @apcamargo there has been a big shift to a new GCC compiler with these new builds, so this may be an issue arising from that.
Any thoughts or help on this much appreciated though! Diagnosing conda build issues gets out of my area of expertise.
I've seen this with 1.10 from bioconda as well. Seems to be showing up only when multithreading is turned on.
I managed to track down that this is actually an issue (presumably library incompatibility) with RcppParallel
in the newer builds of R on CondaForge (and maybe the Anaconda R also). To verify, try this:
library(RcppParallel)
setThreadOptions(numThreads=2)
This will probably reproduce the error.
I solved it by pinning r-base=3.4.1 and installing dada2
using remotes::install_github
.
I installed my environment through Bioconda.
Could you try installing through the recommended channels (i.e. Bioconductor or the GitHub repository) and see if that solves the problem? Our installation page describes how to do both quite easily: https://benjjneb.github.io/dada2/dada-installation.html
Hi - Did anyone having this problem try installing via the 'recommended channels' instead of Bioconda? I don't see a clear answer to this. We're having the same issues with learnErrors as others have already reported, working on our HPCC. Is it correct that the only work-around is to run learnErrors with multithread=FALSE?
Thanks!
Can some one provide a minimal example? Attach the conda diagnostics (conda info
, const list
, conda env export
) as files, as well as some script and necessary files to trigger the segfault.
As a quick comment, I am failing to replicate this error on my local machine. I installed dada2-1.10 thusly:
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels defaults
conda create -n dada2-1.10 python=3 anaconda
source activate dada2-1.10
conda install -c bioconda bioconductor-dada2=1.10.0
Inside that environment, I get the following R versions:
R -e "R.version$version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')"
[1] "R version 3.5.1 (2018-07-02)" Loading required package: Rcpp [1] ‘1.10.0’ [1] ‘0.12.18’ [1] ‘4.4.1’
The output of conda info
is:
Current conda install:
platform : osx-64 conda version : 4.3.29 conda is private : False conda-env version : 4.3.29 conda-build version : not installed python version : 3.6.0.final.0 requests version : 2.12.4 root environment : /Users/bcallah/miniconda3 (writable) default environment : /Users/bcallah/miniconda3/envs/dada2-1.10 envs directories : /Users/bcallah/miniconda3/envs /Users/bcallah/.conda/envs package cache : /Users/bcallah/miniconda3/pkgs /Users/bcallah/.conda/pkgs channel URLs : https://repo.continuum.io/pkgs/main/osx-64 https://repo.continuum.io/pkgs/main/noarch https://repo.continuum.io/pkgs/free/osx-64 https://repo.continuum.io/pkgs/free/noarch https://repo.continuum.io/pkgs/r/osx-64 https://repo.continuum.io/pkgs/r/noarch https://repo.continuum.io/pkgs/pro/osx-64 https://repo.continuum.io/pkgs/pro/noarch https://conda.anaconda.org/bioconda/osx-64 https://conda.anaconda.org/bioconda/noarch https://conda.anaconda.org/conda-forge/osx-64 https://conda.anaconda.org/conda-forge/noarch config file : /Users/bcallah/.condarc netrc file : None offline mode : False user-agent : conda/4.3.29 requests/2.12.4 CPython/3.6.0 Darwin/16.7.0 OSX/10.12.6 UID:GID : 502:20
So any specifics on how to make this segfault happen would be much appreciated!
Here's what's doing it for me:
conda create -n dada2_test bioconductor-dada2=1.10
conda activate dada2_test
Then:
R -e "R.version$version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')"
[1] "R version 3.5.1 (2018-07-02)" Loading required package: Rcpp [1] ‘1.10.0’ [1] ‘1.0.1’ [1] ‘4.4.1’
And conda info
:
active environment : dada2_test
active env location : /home/mworken/miniconda3/envs/dada2_test
shell level : 2
user config file : /home/mworken/.condarc
populated config files : /home/mworken/.condarc
conda version : 4.6.8
conda-build version : not installed
python version : 3.6.7.final.0
base environment : /home/mworken/miniconda3 (writable)
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/mworken/miniconda3/pkgs
/home/mworken/.conda/pkgs
envs directories : /home/mworken/miniconda3/envs
/home/mworken/.conda/envs
platform : linux-64
user-agent : conda/4.6.8 requests/2.21.0 CPython/3.6.7 Linux/3.10.0-514.26.2.el7.x86_64 centos/7.3.1611 glibc/2.17
UID:GID : 160032:3453
netrc file : None
offline mode : False
This gives the same error as above.
Main differences from @benjjneb are system (osx vs linux), conda version and Rcpp version. I tried downgrading Rcpp to 0.12.18 - same error
One more note on this: Myself and a someone else I'm working with have generated this error - same system, 2 different datasets. I'll see if I can put together a minimal example as well.
Here's a minimal example that will segfault with the setup above. But only when learnErrors
is run on both sets read files and only with multithreading turned on. Either one by itself finishes fine and multithread = FALSE
also works.
Read files are from the dada2 tutorial.
library(dada2)
r1_files = c("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R1_001.fastq")
r2_files = c("MiSeq_SOP/F3D0_S188_L001_R2_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R2_001.fastq")
filt_r1 = c("filtered/F3D0_R1.fq.gz", "filtered/F3D141_R1.fq.gz")
filt_r2 = c("filtered/F3D0_R2.fq.gz", "filtered/F3D141_R2.fq.gz")
# so learnErrors doesn't choke on the raw read names
out = filterAndTrim(r1_files, filt_r1, r2_files, filt_r2)
err_r1 = learnErrors(filt_r1, multithread = 2)
err_r2 = learnErrors(filt_r2, multithread = 2)
Hi @benjneb and @epruesse,
Thank you both for looking at this. I am working with @pdcountway to get this working. I followed your instructions to the T with a new Anaconda install. My package version and environment seem to match up with yours (below). I'm not sure what is going on. We are using CentOS 7.6 and I am installing it as an environmental module. I have attached the files that @epruesse requested: dada-diagnostic.txt
As for reproducing the error, I am following the dada2 tutorial (except skipping the plots). Everything works up until the error rates:
errF <- learnErrors(filtFs, multithread=TRUE)
33514080 total bases in 139642 reads from 20 samples will be used for learning the error rates.
caught segfault address 0xfffffffffffffff7, cause 'memory not mapped'
Traceback: 1: RcppParallel::setThreadOptions(numThreads = "auto") 2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction, selfConsist = TRUE, multithread = multithread, verbose = verbose, MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...) 3: learnErrors(filtFs, multithread = TRUE)
R.version.string;library(dada2);packageVersion('dada2');packageVersion('Rcpp');packageVersion('RcppParallel')
[1] "R version 3.5.1 (2018-07-02)" [1] ‘1.10.0’ [1] ‘0.12.18’ [1] ‘4.4.1’
conda info
active environment : dada2-1.10
active env location : /mnt/modules/bin/dada-2/1.10/envs/dada2-1.10
shell level : 1
user config file : /root/.condarc
populated config files : /root/.condarc
conda version : 4.6.8
conda-build version : 3.17.6
python version : 3.7.1.final.0
base environment : /mnt/modules/bin/dada-2/1.10 (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /mnt/modules/bin/dada-2/1.10/pkgs
/root/.conda/pkgs
envs directories : /mnt/modules/bin/dada-2/1.10/envs
/root/.conda/envs
platform : linux-64
user-agent : conda/4.6.8 requests/2.21.0 CPython/3.7.1 Linux/3.10.0-514.10.2.el7.x86_64 centos/7.6.1810 glibc/2.17
UID:GID : 0:0
netrc file : None
offline mode : False
Thanks again!
Just the env from @kguay: dada2-broken.txt
Download, rename to something.yaml
, then run
conda env create -n dada2-segfaulting --file something.yaml
(Sadly, GH does not allow attaching files ending in .yaml
here...)
Ok, please run this with your broken conda environment activated:
ldd $CONDA_PREFIX/lib/R/library/data.table/libs/datatable.so | grep -vE "$CONDA_PREFIX|(lib(c|rt|dl|m|pthread)|ld-linux-x86-64|linux-vdso).so"
In my case, it says
libgfortran.so.3 => not found
It looks like the package was linked to libgfortran.so.3
while it installs libgfortran.so.4
, which in my case leads to no lib found, in other cases might lead to a mismatching one being loaded. Not sure if this is the cause - in theory any matching libgfortran
should work. In practice, there might be ODR violations in the library that cause the segfault.
In either case, this is a broken env.
Ok, just realized that it's data.table
, not dada2
that was causing the loading problem for me for that env of yours.
@kguay Please set conda config --set channel_priority strict
and try installing again and see whether the issue stays.
The problem is that r-data.table
was moved to conda-forge
because it's not life science specific. For some reason, in your env, it picked up the legacy bioconda package which wasn't build with the current dependencies in mind, so didn't require the right things. This is something the conda solver sometimes comes up with, sadly. One way of fixing it is to tell it to not ever use a package from a lower priority channel if it is available in a higher priority one. That's the setting above.
Ok. One step closer. After installing r-data.table
I got the one from conda-forge and with this env I can reproduce the crash.
R -d gdb -f breaks.R
[...]
(gdb) run
[...]
> err_r2 = learnErrors(filt_r2, multithread = 2)
3420302 total bases in 13659 reads from 2 samples will be used for learning the error rates.
[Thread 0x7fffc5e3a700 (LWP 9382) exited]
Thread 1 "R" received signal SIGSEGV, Segmentation fault.
tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffccecf200,
number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0,
context=context@entry=0x7fffd1c90c60 <tbb::internal::the_dummy_context>) at ../../src/tbb/scheduler.cpp:320
320 ../../src/tbb/scheduler.cpp: No such file or directory.
(gdb) bt
#0 tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffccecf200,
number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0,
context=context@entry=0x7fffd1c90c60 <tbb::internal::the_dummy_context>) at ../../src/tbb/scheduler.cpp:320
#1 0x00007fffd1c7862f in tbb::internal::generic_scheduler::generic_scheduler (this=this@entry=0x7fffccecf200, m=...)
at ../../src/tbb/scheduler.cpp:100
#2 0x00007fffd1c7a8ea in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::custom_scheduler (
m=..., this=0x7fffccecf200) at ../../src/tbb/custom_scheduler.h:59
#3 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler (m=...)
at ../../src/tbb/custom_scheduler.h:120
#4 0x00007fffd1c78833 in tbb::internal::allocate_scheduler (m=...) at ../../src/tbb/scheduler.cpp:42
#5 tbb::internal::generic_scheduler::create_master (a=0x7fffccecba00) at ../../src/tbb/scheduler.cpp:1161
#6 0x00007fffd1c7385e in tbb::internal::governor::init_scheduler (num_threads=num_threads@entry=2,
stack_size=stack_size@entry=0, auto_init=auto_init@entry=false) at ../../src/tbb/governor.cpp:200
#7 0x00007fffd1c7393e in tbb::task_scheduler_init::initialize (this=0x55555629c490, number_of_threads=2,
thread_stack_size=0) at ../../src/tbb/governor.cpp:320
#8 0x00007fffd213e1c2 in setThreadOptions ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/library/RcppParallel/libs/RcppParallel.so
#9 0x00007ffff7c31e46 in do_dotcall ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#10 0x00007ffff7c6a5f4 in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#11 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#12 0x00007ffff7c7c020 in R_execClosure ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#13 0x00007ffff7c719fc in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#14 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#15 0x00007ffff7c7c020 in R_execClosure ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#16 0x00007ffff7c719fc in bcEval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#17 0x00007ffff7c7a390 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#18 0x00007ffff7c7c020 in R_execClosure ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#19 0x00007ffff7c7a55c in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#20 0x00007ffff7c7ec8a in do_set () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#21 0x00007ffff7c7a7e4 in Rf_eval () from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#22 0x00007ffff7cabd36 in Rf_ReplIteration ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#23 0x00007ffff7cac19a in R_ReplConsole ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#24 0x00007ffff7cac249 in run_Rmainloop ()
from /home/epruesse/miniconda3/envs/dada2_test/lib/R/bin/exec/../../lib/libR.so
#25 0x000055555555506d in main ()
#26 0x00007ffff791b09b in __libc_start_main (main=0x555555555050 <main>, argc=3, argv=0x7fffffffa878,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffa868)
at ../csu/libc-start.c:308
#27 0x000055555555509d in _start () at ../sysdeps/x86_64/elf/start.S:103
Continued from above:
(gdb) c
Continuing.
*** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'
Traceback:
1: RcppParallel::setThreadOptions(numThreads = multithread)
2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction, selfConsist = TRUE, multithread = multithread, verbose = verbose, MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
3: learnErrors(filt_r2, multithread = 2)
An irrecoverable exception occurred. R is aborting now ...
So on the C side, we have a crash inside of TBB somewhere, with # 8 setThreadOptions
the first function outside of R scope. On the R side, that's dada()
calling RcppParallel::setThreadOptions(numThreads = multithread)
.
Don't ask me what's going on there. But if I run the code above with
err_r1 = learnErrors(filt_r1, multithread = 2)
err_r2 = learnErrors(filt_r2) #, multithread = 2)
it does not segfault.
It also does not segfault if I run R -g valgrind -f breaks.R
- although it does take forever.
So we've got a concurrency bug here... hard to trace. I think we can assume that TBB and RcppParallel have a lesser likelihood than dada2 of having bugs with regard to this.
Ok, if I use mulithread=TRUE
, the first time runs fine again and the second call to learnErrors
crashes in dada2_uniques
. It's reproducible in valgrind
though in GDB it crashes some place else.
@benjjneb This is starting to get out of scope for me. Can you reproduce this? It probably needs building dada2
(and R
) with -fsanitize=asan
& friends (see https://cloud.r-project.org/doc/manuals/r-devel/R-exts.html#Debugging) to see where you've got the concurrency bug.
Thanks for working on this @epruesse. I just read through all of you replies. I don't think that there are any action items for me, but I'm glad that you were able to reproduce the problem. Let me know if you need anything else from me.
I was finally able to replicate this error. Hooray! (sort of) Required moving to a Linux system (Ubuntu) as it wasn't happening on OSX.
@benjjneb Ping me if I can help. The next challenge will be getting the crash with debug options so that you can track a little better where it's breaking.
FYI: Below are the options that Bioconda builds it's binaries with. They are set by the compiler packages provided by anaconda itself. If you install gxx_linux-64
or clangxx_osx-64
into an environment, they are set when you conda activate
it. That might be the best way for you to reproduce the build as closely as possible.
I think the anaconda guys started with the RedHat defaults and added some conda specific flags. Not sure they are optimal, but we haven't even tried second guessing them because we've got enough things to worry about.
LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro \
-Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections
CXXFLAGS=-fvisibility-inlines-hidden -std=c++17 \
-fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize \
-fPIC -fstack-protector-strong -fno-plt -O2 \
-ffunction-sections -pipe
-Wl,-O2
sends -O2
to the linker-Wl,--sort-common
tells the linker to sort common symbols by size (saves space because of alignment requirements)-Wl,--as-needed
tells the linker to only add those -lXYZ
libs to the list of requirements if they actually contribute symbols that are needed. Normally, everything is added. This reduces spurious dependencies.
-Wl,-z,relro
tells the linker to tell ld.so
to mark relocatable code segments as readonly. Security thing.-Wl,-z,now
tells the linker to tell ld.so
to resolve all symbols on program start, not on demand. That way, things fail early if the symbols aren't there.-Wl,--disable-new-dtags
Essentially: makes it so LD_LIBRARY_PATH
doesn't affect the binary. The "old" dtag DT_RPATH
specifying where the dynamic linker should look for the requisite libraries is tried before LD_LIBRARY_PATH
s, the "new" dtag DT_RUNPATH
is tried after. In conda
, the LD_LIBRARY_PATH is turned off to avoid funky tools or cluster setups messing with which libraries should be loaded.-Wl,--gc-sections
enables garbage collections on input sections. Basically reduces the DSO size a bit by tossing stuff that's never used.
-fvisibility-inlines-hidden
Phew. Basically, if you have an inline function, this marks them as "hidden", so they don't get exported as symbols. The drawback is that you can't do pointer compares across objects, but it's rarely done. Makes things a little faster because you don't need to go through PLT when calling the inline function from inside a single DSO.
-std=c++17
Sets the C++ standard to C++17. The reason this is set globally is that some tools (Boost System e.g.) violate ODR in that they emit different symbols for different standards. There are probably other culprits out there. It's safer to compile everything to the same C++ version. -fmessage-length=0
trivial - don't word wrap compiler messages-march=nocona
use CPU features available on nocona
and up-mtune=haswell
tune to haswell
CPUs-ftree-vectorize
enable auto vectorizer-fPIC
generate position independent code (since this is global it makes executables "look" like libraries for older versions of file
, newer versions recognize the difference). This is needed by modern exploit mitigation strategies - randomizing address layout and so on. -fstack-protector-strong
another security feature. Basically wraps functions with some code that checks for broken stack. strong
makes it apply to a few more functions. Essentially, if you have local pointers or buffers, it gets turned on.-fno-plt
another complicated one. Makes things faster by using the GOT
instead of the PLT
to get the address for external function calls. If I understand correctly, this is allowed because of -Wl,-z,now
above.-O2
- obvious, the classic optimize-everything-but-the-crazy-risky-stuff-ffunction-sections
creates individual sections for each function. Good for static, bad for dynamic libs. Not sure why they added this one, actually. -pipe
speeds up compile a tiny little bit by piping data between processes rather than using temp filesWe just discovered that the learnErrors function works with anything other than multithread=2 (or TRUE). So, 1, 3, 4, and 100 all work.
@kguay I've been getting this error using multithread = 80
.
@apcamargo How strange. 80 works for me, but 2 still crashes.
@kguay @apcamargo It just crashed for me with multithread=16, but not until I ran the learnErrors command for the second time (on the reverse reads). It worked fine the first time I ran it (on the forward reads). I believe that @epruesse reported similar behavior.
Hey @benjjneb, just wanted to let you know @ebolyen and I have blocked some time out to dive into this issue a bit more, will keep you posted. Thanks!
@benjjneb @thermokarst @ebolyen I've got a goodie for you that might help debugging and testing:
Our infrastructure now makes package previews available for all PRs opened at bioconda/bioconda-recipes. You get direct download links for the packages and a conda install -c bioconda -c conda-orge -c https://....
snippet that allows you to install the package locally. It's effectively a one-PR micro-channel, so you can even test the effects on whole environments.
You are welcome to use this feature for debugging dada2
- e.g. by creating a PR with just a build number bump and debug flags set. If you need to change the source-urls to point to non-release tar balls, you may have to also add a flag to override the linter to get through to the build stages.
Hope that helps a little. I'm really tight on time and have to force myself to not start digging into this. Ping me if you need help with the Bioconda side of things though!
Thanks @epruesse, that is super awesome!
Recreating what @epruesse did above, with valgrind (just dumping here so that I don't lose this):
R -d valgrind
==23903== Memcheck, a memory error detector
==23903== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==23903== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==23903== Command: /home/matthew/.conda/envs/dada2_debug/lib/R/bin/exec/R
==23903==
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(dada2)
Loading required package: Rcpp
>
> r1_files = c("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R1_001.fastq")
> r2_files = c("MiSeq_SOP/F3D0_S188_L001_R2_001.fastq", "MiSeq_SOP/F3D141_S207_L001_R2_001.fastq")
>
> filt_r1 = c("filtered/F3D0_R1.fq.gz", "filtered/F3D141_R1.fq.gz")
> filt_r2 = c("filtered/F3D0_R2.fq.gz", "filtered/F3D141_R2.fq.gz")
>
> # so learnErrors doesn't choke on the raw read names
> out = filterAndTrim(r1_files, filt_r1, r2_files, filt_r2)
> err_r1 = learnErrors(filt_r1, multithread = 2)
3426981 total bases in 13659 reads from 2 samples will be used for learning the error rates.
> err_r2 = learnErrors(filt_r2, multithread = 2)
3420302 total bases in 13659 reads from 2 samples will be used for learning the error rates.
==23903== Invalid read of size 8
==23903== at 0x23C9A548: tbb::internal::generic_scheduler::allocate_task(unsigned long, tbb::task*, tbb::task_group_context*) (scheduler.cpp:320)
==23903== by 0x23C9A62E: tbb::internal::generic_scheduler::generic_scheduler(tbb::internal::market&) (scheduler.cpp:100)
==23903== by 0x23C9C8E9: custom_scheduler (custom_scheduler.h:59)
==23903== by 0x23C9C8E9: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler(tbb::internal::market&) (custom_scheduler.h:120)
==23903== by 0x23C9A832: allocate_scheduler (scheduler.cpp:42)
==23903== by 0x23C9A832: tbb::internal::generic_scheduler::create_master(tbb::internal::arena*) (scheduler.cpp:1161)
==23903== by 0x23C9585D: tbb::internal::governor::init_scheduler(int, unsigned long, bool) (governor.cpp:200)
==23903== by 0x23C9593D: tbb::task_scheduler_init::initialize(int, unsigned long) (governor.cpp:320)
==23903== by 0x23CF51C1: setThreadOptions (in /home/matthew/.conda/envs/dada2_debug/lib/R/library/RcppParallel/libs/RcppParallel.so)
==23903== by 0x4941E45: do_dotcall (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903== by 0x497A5F3: bcEval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903== by 0x498A38F: Rf_eval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903== by 0x498C01F: R_execClosure (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903== by 0x49819FB: bcEval (in /home/matthew/.conda/envs/dada2_debug/lib/R/lib/libR.so)
==23903== Address 0xfffffffffffffff7 is not stack'd, malloc'd or (recently) free'd
==23903==
*** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'
Traceback:
1: RcppParallel::setThreadOptions(numThreads = multithread)
2: dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction, selfConsist = TRUE, multithread = multithread, verbose = verbose, MAX_CONSIST = MAX_CONSIST, OMEGA_C = OMEGA_C, ...)
3: learnErrors(filt_r2, multithread = 2)
Hey @benjjneb, just wanted to let you know @ebolyen and I have blocked some time out to dive into this issue a bit more, will keep you posted. Thanks!
If you can let me know in advance at all I will try to block out the same time as well and be available online and to try out code changes. I admit I'm struggling with this one myself.
The catch with the error above is that it's the symptom, not the cause. 0xff....7
is -9
, not a likely memory address. It's curious that everyone sees that address. Sounds like something writes that number into places it doesn't belong. That itself might yet be another symptom though, so might not be the easiest "thread" (no pun) to pull on.
I saw that you have written some custom C style data structures and allocation functions. If you share and access those from multiple threads, you have to be really careful about concurrency. Writing thread safe data structures is a seriously complicated art. Consider that something as unassuming as ++i
is not thread safe (unless i
is std::atomic<int>
, which might just use a mutex internally to make sure you're ok). I don't have the time to review the code completely, but start looking at the places shared by threads, the writes, allocations and deallocations. You probably have to throw in a few mutexes - if done right, they won't cost that much time.
I might be wrong. I've been before. This type of bug is hell. All I know for certain that is that there are similar ones lurking in SINA, waiting to be "expressed".
Googling about, I found some highly relevant links I think:
https://github.com/cms-sw/cmssw/issues/14409 https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/636519
It looks like TBB fails with "invalid read of size 8" relatively regularly, and it looks at first glance like an initialization out-of-order problem.
It is also possible that TBB is linked wrong, but that seems less likely as I would expect half of bioconda to have already exploded.
TBB is from Conda-Forge, so not even just Bioconda. Moreover, IIRC the recipe was actually provided by Intel engineers. Still, just trying a few different versions of TBB might be worth exploring. Their ABI is very stable, so probably not too complicated to do.
Turns out rcppparallel vendors TBB, so its not a link issue. Double init still seems like a good option, so I'm going to look through the rcppparallel code to see if I can find a place fo that to happen. @thermokarst is testing to see if compiling a new rcppparallel helps at all.
RcppParallel 4.4.2
does not segfault for me when running the MWE above.
I wasn't able to find the github commit that relates to (EDIT: as usual, I'm looking at an old version masked by a different channel :-/) Nothing is different with respect to TBB init (and they are doing it the right way), so I don't think DADA2 can cause a double init here. Also after walking through DADA2's C++, I don't see how it would have even happened. The 4.3.20
, but I pulled the source down and compared it to master.rcpp::setThreadOptions
call occurs well before anything interesting even happens...
@thermokarst, if you compile the old RcppParallel, what do we see?
Steps to reproduce https://github.com/benjjneb/dada2/issues/684#issuecomment-482685395
conda create -n dada2_debug \
-c conda-forge \
-c bioconda \
-c defaults \
--override-channels \
bioconductor-dada2=1.10
conda activate dada2_debug
# Using RcppParallel 4.4.1 (from conda-forge)
> library(dada2)
> filterAndTrim("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "filtered/F3D0_R1.fq.gz")
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
# segfaults
# upgrade RcppParallel
conda uninstall --force r-rcppparallel
R -e 'install.packages("RcppParallel", repos="http://cran.us.r-project.org")'
# Using RcppParallel 4.4.2 (from CRAN)
> library(dada2)
> filterAndTrim("MiSeq_SOP/F3D0_S188_L001_R1_001.fastq", "filtered/F3D0_R1.fq.gz")
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
> learnErrors("filtered/F3D0_R1.fq.gz", multithread = 2)
# does not segfault
@thermokarst, if you compile the old RcppParallel, what do we see?
Compiling RcppParallel 4.4.1 from source:
conda uninstall --force r-rcppparallel
R -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/RcppParallel/RcppParallel_4.4.1.tar.gz", repos=NULL, type="source")'
The MWE example above completes without segfaulting. It looks like this is a conda build environment related issue, since the same version of RcppParallel worked just fine when built locally. Getting closer!
Extremely relevant: https://github.com/01org/tbb/issues/83
Since RcppParallel vendors TBB, it is probably being compiled without the workaround needed for GCC7.
Ok, we're pretty convinced its a missing compile flag for linux (-flifetime-dse=1
).
We're going to test this theory by building the conda-forge recipe, via the bioconda tools (since that's really easy). We're expecting to see the same issue with the local build.
Then we'll update the recipe to use:
$R CMD INSTALL --configure-args="CXXFLAGS='-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -flifetime-dse=1'" --build .
Which uses the same CXXFLAGS, but adds -flifetime-dse=1
.
@epruesse, assuming this works, there's probably a better way to set that compiler flag. Do you have any thoughts on that?
Update, we are actually building and testing locally using the conda-forge tooling, since the package recipe is coming from conda-forge (plus, tooling exists here).
We have a build that fails locally!
OK, so we are feeling pretty confident we know whats going on.
Here's what we know:
The -fliftime-dse
is a part of the Makefile. However it is detected by probing gcc
We see during the conda build:
make -e -C "../build/lib_release" -r -f ../../build/Makefile.tbb cfg=release
make[2]: Entering directory '$SRC_DIR/src/tbb/build/lib_release'
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
/bin/sh: gcc: command not found
Which looks a lot like those feature-flag tests failing to produce anything useful. Meaning we don't end up with the -flifetime-dse=1
flag (and probably other important flags).
This is why a compile from source (outside of conda) seems to work fine. It has the gcc
binary and detects the appropriate flags to use.
What we're struggling to do is convince make
to use the conda binary defined in $CC
instead of a binary gcc
. This seems like a tractable problem however, so we're pretty optimistic.
That said, we're stepping away for the weekend, if anyone wants to try fixing the RcppParallel build in the interim, please let us know how it goes! We'll see if we can pick this back up on Monday.
Couldn't you just add -flifetime-dse=1
to the CXXFLAGS
for Rcpp and dada2 to see if this fixes things?
I'm bumping into segmentation faults when trying to use
learnErrors
. This happened with version 1.8.0 and 1.10.0. I don't have this problem in version 1.6.0.Here's the traceback when I was using version 1.10.0.