NaluCFD / Nalu

Nalu: a generalized unstructured massively parallel low Mach flow code designed to support a variety of open applications of interest built on the Sierra Toolkit and Trilinos solver Tpetra solver stack. The open source BSD, clause 3 license model has been chosen for the code base. See LICENSE for more information.
https://github.com/NaluCFD/Nalu
Other
142 stars 66 forks source link

Changes to Trilinos causing diffs #294

Closed jrood-nrel closed 6 years ago

jrood-nrel commented 6 years ago

There are 17 tests that are resulting in diffs. After investigation using the latest Trilinos develop branch and comparing against the develop branch from a few weeks ago, the diffs do not happen in the older Trilinos version. This is using the latest Nalu as of Jan 5. Therefore, something must have changed in Trilinos January 3rd that is causing the issue.

spdomin commented 6 years ago

@jrood-nrel, we need to let the Trillions team know this and perform a bisect. Have you done one of those before? It generally takes a couple of hours with the build/bisect iteration - most of which can be done in the background during a normal days work. @mhoemmen may also have perspective from the Sierra side of using Trilinos.

Let me know if you can submit the Trilinos issue and if you need help on the bisect. Alternatively, we can pick someone at the stand-up to do this work.

Do we have both master and develop tested nightly in Nalu?

Finally, did the clang set-fault start at the same time these diffs started?

jrood-nrel commented 6 years ago

Ok I will find the commit causing the diffs and post the issue to Trilinos. I can probably just modify my testing script to inject the bisect before Trilinos builds.

We have been testing against the Trilinos master branch since October on the Mac. Waiting for the testing machine when I will have enough resources to test it on Linux.

Actually there aren't clang segfaults. They are with GCC+OpenMP on OSX and they started the same day, but I am not yet certain they were the same commit due to a separate error that stopped Nalu from building on the Mac that day.

jrood-nrel commented 6 years ago

Here are the results of the bisect with 4 neighboring commits from Jan 3rd that are related:

Trilinos develop commits:
bd5e7e36346b88dc7984acd660f2e6138672a96 bad
841e89f5cb58e7d034ab6c68fb3887b2d39c5f5 good
bb82440caacd3b154a8bc087f1d081ca39a108f good
1565f43d2a42c7fe92f0f8bd74367a26f0664d1 good

@spdomin Do we want to submit an issue to Trilinos regarding this, or shall we accept the diffs?

mhoemmen commented 6 years ago

@jrood-nrel Thanks for digging into this! It looks like commit https://github.com/trilinos/Trilinos/commit/bd5e7e36346b88dc7984acd660f2e6138672a96a enabled thread-parallel sparse matrix-matrix multiply by default, when using the OpenMP execution space with Tpetra. This may give a slightly different multigrid hierarchy, and thus slightly different linear solver results in exact arithmetic. Iterative linear solvers only promise accuracy with respect to an L2 norm; they don't promise bitwise identical results.

I can't promise that there is no bug, but @csiefer2, @jhux2, and @mndevec worked hard on this, and it's been tested pretty well. The code was there for a long time and was tested separately, but was not enabled by default.

Do the diffs go away if you set OMP_NUM_THREADS=1? Are the diffs large enough to be suspicious?

mhoemmen commented 6 years ago

Also, the commit in question (or its issue) should have been marked "Results Impacting" per Trilinos policy. I'm grateful to see this code finally enabled, though. I know those folks worked really hard on it, and it will help make multigrid setup more thread scalable.

csiefer2 commented 6 years ago

@mhoemmen is correct. If you weren't configuring with -DKokkosKernels_ENABLE_Experimental=ON before, Trilinos' default behavior did change.

If you can hand me something to reproduce the problem (a particular SHA1 for Nalu, a list of failing tests and a link to Nalu build instructions) I can investigate.

jrood-nrel commented 6 years ago

@mhoemmen Sounds good. Thanks for replying to this. I'm wondering if these commits are causing the segfaults on OSX with GCC and OMP? I will try to test that as well. The tests are currently run with OMP_NUM_THREADS=1; OMP_PROC_BIND=FALSE. It looks like the results are deterministic with OMP_NUM_THREADS=1. The largest diff I'm seeing is 1e-5 and we diff against norms so they can get fairly large in my opinion and still be legit. Based on this, I am fine updating our gold files.

@spdomin Based on this information, shall we update the gold files?

mhoemmen commented 6 years ago

@jrood-nrel, @csiefer2 tells me that he would like to investigate a bit before y'all update your gold files. Thanks!

jrood-nrel commented 6 years ago

@mhoemmen No problem.

@csiefer2 The documentation for building Nalu is http://nalu.readthedocs.io/en/latest/source/user/building.html . You should be fine using the latest Nalu master branch. You can see the test results at https://my.cdash.org/index.php?project=Nalu . If you look at the notes for each build, you can see more about how the testing is happening. The peregrine machine with GCC is the testing that uses 1e-15 constraints, so pay attention to the 17 failing tests there regarding this. Let me know if you have any questions.

jhux2 commented 6 years ago

@csiefer2 Depending on what platform you decide to use, I can help you with the Nalu build. @spdomin has prebuilt TPLs on skybridge and ghost.

jrood-nrel commented 6 years ago

So what I feel are the same set of commits seem to also be causing the segfaults I am seeing on OSX with GCC 7.2.0 and OpenMP on. Therefore I feel like even if we disregard the diffs, there seems to be a bug in this set of commits. The results of the segfault bisect are:

Neighboring Trilinos develop commits regarding segfaults on OSX with GCC and OpenMP:
841e89f5cb58e7d034ab6c68fb3887b2d39c5f5 bad
bb82440caacd3b154a8bc087f1d081ca39a108f good
1565f43d2a42c7fe92f0f8bd74367a26f0664d1 good
mhoemmen commented 6 years ago

FYI @jrood-nrel is referring to https://github.com/trilinos/Trilinos/issues/2130 ; see e.g., https://github.com/trilinos/Trilinos/issues/2130#issuecomment-356744081

spdomin commented 6 years ago

Thank you for all of the discussion and work on this front. Let me know if I can help.

The pre-built Ghost and Skybridge TPLS, as Jonathan notes, are for intel-based builds. However, these diffs are on our linux-flavor gold development platform. Following the SPACK build process is probably easiest to follow.

Of coarse, if you are simply evaluating the new and old behavior, any platform might work and, as such, the Ghost and Skybridge pre-built locations are ready for your usage. I may need to add Chris to the naku-users group - something that I will do now.

jhux2 commented 6 years ago

@spdomin Have you or anyone else done a SPACK build for the CEE lan?

jhux2 commented 6 years ago

@csiefer2 I've built nalu against Trilinos develop in RELWITHDEBINFO mode. Have a look on cee-compute011, /scratch/jhu/nalu.

csiefer2 commented 6 years ago

I can reproduce the tolerance diffs. (Thanks for the assist @jhux2 )

jrood-nrel commented 6 years ago

I disabled openmp on all the test machines for the moment and the diffs aren't happening.

csiefer2 commented 6 years ago

As for the OSX errors, I would highly recommend setting OMP_NUM_THREADS in your testing script so that the below warning goes away. From my experience at least, oversubscribing codes (especially if OMP_PROC_BIND is set) can cause segfaults.

Kokkos::OpenMP::initialize WARNING: You are likely oversubscribing your CPU cores. Detected: 4 cores per node. Detected: 8 MPI_ranks per node. Requested: 1 threads per process.

csiefer2 commented 6 years ago

As for the diffs on peregrine I want to do a bit more digging on uqSlidingMeshDG, because those look pretty big

jrood-nrel commented 6 years ago

@csiefer2 If you look at the notes on the OSX builds, OMP_NUM_THREADS=1 and OMP_PROC_BIND=false are set before the tests are run. The OSX machine is dual core and many of the tests run with mpiexec -np >2 which is why those warnings appear. According to that output it states that the test machine has 4 logical cores and it is running with 8 ranks and 1 thread for that test, which is all true. My issue regarding this warning is listed here https://github.com/kokkos/kokkos/issues/1142 and has been resolved but I haven't implemented explicitly turning the warning off yet. The elemClosedDomain test runs with np = 2 with no KOKKOS warnings and that test still segfaults. From my experience, segfaults are usually a bug in the application and not the machine configuration. These segfaults have never happened before Trilinos commit 841e89f5cb58e7d034ab6c68fb3887b2d39c5f5.

csiefer2 commented 6 years ago

Drat. I was hoping for an easy answer.

@jhux2 Do you have a OSX machine you'd care to let us try Nalu builds on?

jhux2 commented 6 years ago

@csiefer2 I don't. @jrood-nrel Is it possible to get accounts on the NREL mac test machines?

csiefer2 commented 6 years ago

@jrood-nrel What would also be useful is a valgrind of one of the failing tests...

jhux2 commented 6 years ago

What would also be useful is a valgrind of one of the failing tests...

@jrood-nrel Would it be possible for you to run one of the small tests by hand with Valgrind on the test machine, and send us the results?

jrood-nrel commented 6 years ago

Sorry but the Mac test machine is essentially categorized as my personal machine, so I can't give access to it. I can give you an approximate list of steps to install if you have access to OSX as such:

#/bin/bash
# brew install gcc #7.2.0
# brew install pkg-config; brew install cmake #as well if on OSX Sierra
cd ${HOME} && git clone https://github.com/spack/spack.git
cd ${HOME} && git clone https://github.com/NaluCFD/NaluSpack.git
# Then comment out the line that disables OpenMP in NaluSpack/install_scripts/install_nalu_gcc_mac.sh
SPACK_ROOT=${HOME}/spack
source ${SPACK_ROOT}/share/spack/setup-env.sh
cd ${HOME}/NaluSpack/spack_config && ./setup_spack.sh
cd ${HOME}/NaluSpack/install_scripts && ./install_nalu_gcc_mac.sh

Or you can browse http://nalu.readthedocs.io/en/latest/source/user/build_spack.html#mac-os-x-or-linux for slightly more info.

I can't work on this again until next week, but I can try to help with valgrind output then.

jrood-nrel commented 6 years ago

Maybe GCC 7.2.0 exhibits this same behavior on Linux? I haven't tried that. I could also try that next week.

jhux2 commented 6 years ago

@jrood-nrel We don't have access to a Mac, otherwise we'd try reproducing this locally. The Trilinos framework team is in the process of adding gcc 7.2.0 coverage on Linux to the Trilinos dashboard (issue trilinos/trilinos#2028).

jrood-nrel commented 6 years ago

@jhux2 Ok, no problem. I will try to go in depth with the segfaults and valgrind and try GCC 7.2.0 on Linux and relay the information I find.

jrood-nrel commented 6 years ago

I put together some valgrind output while waiting on a job to run. When building Trilinos with build_type=debug, I wasn't able to reproduce the segfault. Here is the output from the release build (-np 2 so twice the output):

==36379== Command: /Users/user/Nalu/build/naluX -i /Users/user/Nalu/reg_tests/test_files/elemClosedDomain/elemClosedDomain.i -o elemClosedDomain.log
26: ==36379== 
26: ==36380== Syscall param msg->desc.port.name points to uninitialised byte(s)
26: ==36380==    at 0x105D9734A: mach_msg_trap (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36380==    by 0x105D96796: mach_msg (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36380==    by 0x105D90485: task_set_special_port (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36380==    by 0x105F2C10E: _os_trace_create_debug_control_port (in /usr/lib/system/libsystem_trace.dylib)
26: ==36380==    by 0x105F2C458: _libtrace_init (in /usr/lib/system/libsystem_trace.dylib)
26: ==36380==    by 0x10502F9DF: libSystem_initializer (in /usr/lib/libSystem.B.dylib)
26: ==36380==    by 0x104217A1A: ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
26: ==36380==    by 0x104217C1D: ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
26: ==36380==    by 0x1042134A9: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36380==    by 0x104213440: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36380==    by 0x104212523: ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36380==    by 0x1042125B8: ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) (in /usr/lib/dyld)
26: ==36380==  Address 0x10c28891c is on thread 1's stack
26: ==36380==  in frame #2, created by task_set_special_port (???:)
26: ==36380== 
26: ==36379== Syscall param msg->desc.port.name points to uninitialised byte(s)
26: ==36379==    at 0x105D9734A: mach_msg_trap (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36379==    by 0x105D96796: mach_msg (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36379==    by 0x105D90485: task_set_special_port (in /usr/lib/system/libsystem_kernel.dylib)
26: ==36379==    by 0x105F2C10E: _os_trace_create_debug_control_port (in /usr/lib/system/libsystem_trace.dylib)
26: ==36379==    by 0x105F2C458: _libtrace_init (in /usr/lib/system/libsystem_trace.dylib)
26: ==36379==    by 0x10502F9DF: libSystem_initializer (in /usr/lib/libSystem.B.dylib)
26: ==36379==    by 0x104217A1A: ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
26: ==36379==    by 0x104217C1D: ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) (in /usr/lib/dyld)
26: ==36379==    by 0x1042134A9: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36379==    by 0x104213440: ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36379==    by 0x104212523: ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) (in /usr/lib/dyld)
26: ==36379==    by 0x1042125B8: ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) (in /usr/lib/dyld)
26: ==36379==  Address 0x10c28891c is on thread 1's stack
26: ==36379==  in frame #2, created by task_set_special_port (???:)
26: ==36379== 
26: --36379-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
26: --36380-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
26: --36379-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
26: --36380-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
26: --36379-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
26: --36380-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
26: --36379-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
26: --36380-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 8 times)
26: ==36380== Thread 2:
26: ==36380== Invalid read of size 4
26: ==36380==    at 0x105EF4899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
26: ==36380== 
26: ==36380== Invalid read of size 8
26: ==36380==    at 0x105EF29D6: pthread_getspecific (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x10667541F: show_stackframe (in /Users/user/spack/opt/spack/darwin-sierra-x86_64/gcc-7.2.0/openmpi-1.10.4-fs62tl3cts5jmac6xsqlcy6cnal4ro2w/lib/libopen-pal.13.dylib)
26: ==36380==    by 0x25805BBB1: ???
26: ==36380==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==  Address 0x50 is not stack'd, malloc'd or (recently) free'd
26: ==36380== 
26: ==36380== 
26: ==36380== Process terminating with default action of signal 11 (SIGSEGV)
26: ==36380==  Access not within mapped region at address 0x50
26: ==36380==    at 0x105EF29D6: pthread_getspecific (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x10667541F: show_stackframe (in /Users/user/spack/opt/spack/darwin-sierra-x86_64/gcc-7.2.0/openmpi-1.10.4-fs62tl3cts5jmac6xsqlcy6cnal4ro2w/lib/libopen-pal.13.dylib)
26: ==36380==    by 0x25805BBB1: ???
26: ==36380==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36380==  If you believe this happened as a result of a stack
26: ==36380==  overflow in your program's main thread (unlikely but
26: ==36380==  possible), you can try to increase the size of the
26: ==36380==  main thread stack using the --main-stacksize= flag.
26: ==36380==  The main thread stack size used in this run was 67104768.
26: --36380:0:schedule VG_(sema_down): read returned -4
26: ==36379== Thread 2:
26: ==36379== Invalid read of size 4
26: ==36379==    at 0x105EF4899: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
26: ==36379== 
26: ==36380== 
26: ==36380== HEAP SUMMARY:
26: ==36380==     in use at exit: 670,873 bytes in 6,778 blocks
26: ==36380==   total heap usage: 10,340 allocs, 3,562 frees, 978,105 bytes allocated
26: ==36380== 
26: ==36379== Invalid read of size 8
26: ==36379==    at 0x105EF29D6: pthread_getspecific (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x10667541F: show_stackframe (in /Users/user/spack/opt/spack/darwin-sierra-x86_64/gcc-7.2.0/openmpi-1.10.4-fs62tl3cts5jmac6xsqlcy6cnal4ro2w/lib/libopen-pal.13.dylib)
26: ==36379==    by 0x25805BBB1: ???
26: ==36379==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==  Address 0x50 is not stack'd, malloc'd or (recently) free'd
26: ==36379== 
26: ==36379== 
26: ==36379== Process terminating with default action of signal 11 (SIGSEGV)
26: ==36379==  Access not within mapped region at address 0x50
26: ==36379==    at 0x105EF29D6: pthread_getspecific (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x10667541F: show_stackframe (in /Users/user/spack/opt/spack/darwin-sierra-x86_64/gcc-7.2.0/openmpi-1.10.4-fs62tl3cts5jmac6xsqlcy6cnal4ro2w/lib/libopen-pal.13.dylib)
26: ==36379==    by 0x25805BBB1: ???
26: ==36379==    by 0x105EF4886: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==    by 0x105EF408C: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
26: ==36379==  If you believe this happened as a result of a stack
26: ==36379==  overflow in your program's main thread (unlikely but
26: ==36379==  possible), you can try to increase the size of the
26: ==36379==  main thread stack using the --main-stacksize= flag.
26: ==36379==  The main thread stack size used in this run was 67104768.
26: --36379:0:schedule VG_(sema_down): read returned -4
26: ==36379== 
26: ==36379== HEAP SUMMARY:
26: ==36379==     in use at exit: 670,873 bytes in 6,778 blocks
26: ==36379==   total heap usage: 10,340 allocs, 3,562 frees, 978,105 bytes allocated
26: ==36379== 
26: ==36380== LEAK SUMMARY:
26: ==36380==    definitely lost: 80,205 bytes in 159 blocks
26: ==36380==    indirectly lost: 106,245 bytes in 492 blocks
26: ==36380==      possibly lost: 840 bytes in 10 blocks
26: ==36380==    still reachable: 476,175 bytes in 6,029 blocks
26: ==36380==         suppressed: 7,408 bytes in 88 blocks
26: ==36380== Rerun with --leak-check=full to see details of leaked memory
26: ==36380== 
26: ==36380== For counts of detected and suppressed errors, rerun with: -v
26: ==36380== Use --track-origins=yes to see where uninitialised values come from
26: ==36380== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 1 from 1)
mndevec commented 6 years ago

@jrood-nrel What are the Kokkos execution spaces that are enabled here? Does it enable Kokkos threads execution space? It is interesting to see pthread related errors. --Sorry if I am missing a Trilinos configure file that was already provided.

jrood-nrel commented 6 years ago

@mndevec This is what I'm seeing for a summary of the Kokkos configuration. OpenMP is explicitly turned on during configure with -DKokkos_ENABLE_OpenMP:BOOL=ON. There are some pthreads related things that appear to be turned on such as Setting KokkosCore_ENABLE_Pthread=ON since TPL_ENABLE_Pthread=ON, though I am not enabling it explicitly.

-- ****************** Kokkos Settings ******************
-- Execution Spaces
--   Device Parallel: None
--     Host Parallel: OpenMP
--       Host Serial: Serial
--
-- Architectures:
--     None
--
-- Enabled options
--   KOKKOS_ENABLE_PROFILING
--
-- Final kokkos settings variable:
--   env;KOKKOS_SRC_PATH=/Users/user/spack/var/spack/stage/trilinos-develop-cdyz2nenor6k26yshxhrgbrx3aa4vwc7/Trilinos/packages/kokkos;KOKKOS_PATH=/Users/user/spack/var/spack/stage/trilinos-develop-cdyz2nenor6k26yshxhrgbrx3aa4vwc7/Trilinos/packages/kokkos;KOKKOS_INSTALL_PATH=/Users/user/spack/opt/spack/darwin-sierra-x86_64/gcc-7.2.0/trilinos-develop-cdyz2nenor6k26yshxhrgbrx3aa4vwc7;KOKKOS_ARCH=None;KOKKOS_DEVICES=OpenMP,Serial;KOKKOS_DEBUG=no;KOKKOS_OPTIONS=disable_dualview_modify_check
-- *****************************************************
mndevec commented 6 years ago

Kokkos does not support enabling of both OpenMP and Pthreads at the same time but it looks like it is not the case. I think KokkosCore_ENABLE_Pthread is an unused cmake variable.

Is it possible to add below to the executable to see the all kokkos macros? Kokkos::print_configuration(cout);

jrood-nrel commented 6 years ago

I can reproduce the segfaults with GCC 7.2.0 on Linux with OpenMP on. So it turns out the segfaults are not an OSX issue.

mhoemmen commented 6 years ago

FYI this might be relevant: https://github.com/trilinos/Trilinos/issues/2161#issuecomment-358151082

jrood-nrel commented 6 years ago

@mhoemmen I just tried the develop branch which is 5135ad8 [Wed Jan 17 10:22:22 2018 -0800] and I still get segfaults on Linux with GCC 7.2.0 with OpenMP enabled so it doesn't appear commit d86ccdc made a difference.

mhoemmen commented 6 years ago

@csiefer2 is working on it. Thanks!

csiefer2 commented 6 years ago

A brief update: Heisenbug! DEBUG builds work fine and adding output statements changes the code behavior.

jrood-nrel commented 6 years ago

@csiefer2 Thanks. Both @jhux2 and I have seen the Debug build silence the errors as well https://github.com/trilinos/Trilinos/issues/2130#issuecomment-358474923

mhoemmen commented 6 years ago

Correction: @csiefer2 is working hard on it and @jhux2 is also helping out :-)

csiefer2 commented 6 years ago

I just pushed a GCC 7.2.0-specific workaround into the code, switching the code path to make the troublesome error disappear (at least in the Trilinos tests).

We will leave the associated Trilinos issue open, but I didn't want to continue blocking Nalu for something this hard to track down.

jrood-nrel commented 6 years ago

@csiefer2 Can you confirm that this implemented workaround will fix the diffs we were seeing and not just the segfault error?

jhux2 commented 6 years ago

@jrood-nrel Remind me -- what version of gcc were you seeing the diffs with?

jrood-nrel commented 6 years ago

@jhux2 It was both GCC 5.2.0 and 4.9.2 showing diffs, each on different machines.

jrood-nrel commented 6 years ago

I'm still seeing diffs when I enable OpenMP. I think we may just have to accept it. I think I would be fine including another build line to test with OpenMP both enabled and disabled, and turning down the tolerance on the OpenMP enabled configuration, or we can update the golds to either reflect OpenMP enabled or disabled, though I haven't checked if the OpenMP enabled results are deterministic.

spdomin commented 6 years ago

@jrood-nrel, before we accept diffs, which I agree may be what we do in the end, it would be nice to hear from the Trilinos team on whether or not activating openMP with num_threads = 1 results in a different Trilinos code path? This statement was hypothesized in our standup today.

mhoemmen commented 6 years ago

@spdomin The "OpenMP with OMP_NUM_THREADS=1 path may invoke a different sparse matrix-matrix multiply routine, so yes. Lots of MueLu developers are away (abroad, in fact) at a conference this week, so it would be better to wait and ask next week once they get back.

jrood-nrel commented 6 years ago

Ok great. Thanks for picking this conversation back up.

jrood-nrel commented 6 years ago

I think enough time has passed that it makes sense to accept these changes to OpenMP behavior and I will keep the OMP disabled behavior as the gold standard to machine precision and relax the tolerance where necessary on the Linux/GCC cases with OMP enabled. I'm sure when I start to test with OMP_NUM_THREADS>1 at some point in the future, the results in which we check for diffs will become non-deterministic anyway and will likely require this approach regardless. I will close this issue this week and proceed as planned, unless anyone wants to request that we keep it open.

csiefer2 commented 6 years ago

The errors that we believe were associated with the original GCC 7.2.0 segfault error have been corrected. Are you still diffing on 5.2.0 and 4.9.2?

jrood-nrel commented 6 years ago

@csiefer2 Yes, the GCC 7.2.0 problems have went away. However, if I enable OpenMP and use OMP_NUM_THREADS=1, then I will still get different results than when OpenMP is fully disabled in Trilinos. This used to not be the case. If a different code path is being used when OpenMP is enabled now, it's fine. There just hasn't been much confirmation that such behavior is expected. I'm just planning on treating OMP and non-OMP differently when testing now.