geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
227 stars 237 forks source link

floating point exception (clang6) #2211

Closed tjhei closed 6 years ago

tjhei commented 6 years ago

I am getting

Thread 1 "aspect" received signal SIGFPE, Arithmetic exception.
0x0000000001a1f19d in aspect::Assemblers::StokesIncompressibleTerms<2>::execute (this=0x603000556c38, scratch_base=..., data_base=...)
    at ../source/simulator/assemblers/stokes.cc:205
205           for (unsigned int i=0; i<stokes_dofs_per_cell; ++i)

when running solcx in the second nonlinear solve. I am not sure what is going on here.

gassmoeller commented 6 years ago

I hit an assertion in sol_cx and other places with deal.II 9.0, might be related, but is slightly different:

--------------------------------------------------------
An error occurred in line <7264> of file </home/rengas/Software/dealii/include/deal.II/numerics/vector_tools.templates.h> in function
    void dealii::VectorTools::internal::do_integrate_difference(const dealii::hp::MappingCollection<dim, spacedim>&, const DoFHandlerType&, const InVector&, const dealii::Function<spacedim>&, OutVector&, const dealii::hp::QCollection<dim>&, const dealii::VectorTools::NormType&, const dealii::Function<spacedim>*, double) [with int dim = 2; InVector = dealii::TrilinosWrappers::MPI::BlockVector; OutVector = dealii::Vector<float>; DoFHandlerType = dealii::DoFHandler<2, 2>; int spacedim = 2]
The violated condition was: 
    exact_solution.n_components==n_components
Additional information: 
    Dimension 1 not equal to 4.

Stacktrace:
-----------
#0  /home/rengas/Software/deal.II-dev/lib/libdeal_II.g.so.9.0.0-rc0: 
#1  /home/rengas/Software/deal.II-dev/lib/libdeal_II.g.so.9.0.0-rc0: void dealii::VectorTools::integrate_difference<2, dealii::TrilinosWrappers::MPI::BlockVector, dealii::Vector<float>, 2>(dealii::Mapping<2, 2> const&, dealii::DoFHandler<2, 2> const&, dealii::TrilinosWrappers::MPI::BlockVector const&, dealii::Function<2, double> const&, dealii::Vector<float>&, dealii::Quadrature<2> const&, dealii::VectorTools::NormType const&, dealii::Function<2, double> const*, double)
#2  ./libsol_cx_2.so: aspect::InclusionBenchmark::SolCxPostprocessor<2>::execute(dealii::TableHandler&)
#3  ../aspect: aspect::Postprocess::Manager<2>::execute(dealii::TableHandler&)
#4  ../aspect: aspect::Simulator<2>::postprocess()
#5  ../aspect: aspect::Simulator<2>::run()
#6  ../aspect: void run_simulator<2>(std::string const&, bool, bool)
#7  ../aspect: main
--------------------------------------------------------

I have an idea where it is coming from and will fix, lets see if that solves your issue as well.

gassmoeller commented 6 years ago

The fix for my problem is in #2214, does your error still occur after that fix?

gassmoeller commented 6 years ago

I can reproduce exactly your error message on ubuntu 14.04 (clang 6 manually installed) and 18.04 (clang 6 installed from repository). I can not see what is happening though. Is there a tester for deal.II with clang 6 and this particular setup? Then we could at least narrow down if the problem is in aspect or deal.II.

gassmoeller commented 6 years ago

This is the full callstack:

[cb45769cb27a:06347] *** Process received signal ***
[cb45769cb27a:06347] Signal: Floating point exception (8)
[cb45769cb27a:06347] Signal code: Invalid floating point operation (7)
[cb45769cb27a:06347] Failing at address: 0x135f71c
[cb45769cb27a:06347] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f0098b0a890]
[cb45769cb27a:06347] [ 1] ../aspect(_ZNK6aspect10Assemblers25StokesIncompressibleTermsILi2EE7executeERNS_8internal8Assembly7Scratch11ScratchBaseILi2EEERNS4_8CopyData12CopyDataBaseILi2EEE+0x3ec)[0x135f71c]
[cb45769cb27a:06347] [ 2] ../aspect(_ZN6aspect9SimulatorILi2EE28local_assemble_stokes_systemERKN6dealii18TriaActiveIteratorINS2_15DoFCellAccessorINS2_10DoFHandlerILi2ELi2EEELb0EEEEERNS_8internal8Assembly7Scratch12StokesSystemILi2EEERNSC_8CopyData12StokesSystemILi2EEE+0x396)[0x128c3c6]
[cb45769cb27a:06347] [ 3] ../aspect(_ZNSt5_BindIFMN6aspect9SimulatorILi2EEEFvRKN6dealii18TriaActiveIteratorINS3_15DoFCellAccessorINS3_10DoFHandlerILi2ELi2EEELb0EEEEERNS0_8internal8Assembly7Scratch12StokesSystemILi2EEERNSD_8CopyData12StokesSystemILi2EEEEPS2_St12_PlaceholderILi1EESP_ILi2EESP_ILi3EEEE6__callIvJRNS3_16FilteredIteratorIS9_EESH_SL_EJLm0ELm1ELm2ELm3EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE+0x95)[0x129e425]
[cb45769cb27a:06347] [ 4] ../aspect(_ZNSt5_BindIFMN6aspect9SimulatorILi2EEEFvRKN6dealii18TriaActiveIteratorINS3_15DoFCellAccessorINS3_10DoFHandlerILi2ELi2EEELb0EEEEERNS0_8internal8Assembly7Scratch12StokesSystemILi2EEERNSD_8CopyData12StokesSystemILi2EEEEPS2_St12_PlaceholderILi1EESP_ILi2EESP_ILi3EEEEclIJRNS3_16FilteredIteratorIS9_EESH_SL_EvEET0_DpOT_+0x51)[0x129dc21]
[cb45769cb27a:06347] [ 5] ../aspect(_ZN6dealii10WorkStream3runISt5_BindIFMN6aspect9SimulatorILi2EEEFvRKNS_18TriaActiveIteratorINS_15DoFCellAccessorINS_10DoFHandlerILi2ELi2EEELb0EEEEERNS3_8internal8Assembly7Scratch12StokesSystemILi2EEERNSF_8CopyData12StokesSystemILi2EEEEPS5_St12_PlaceholderILi1EESR_ILi2EESR_ILi3EEEES2_IFMS5_FvRKSM_ESQ_SS_EENS_16FilteredIteratorISB_EESI_SM_EEvRKT1_RKNS_8identityIS15_E4typeET_T0_RKT2_RKT3_jj+0x107)[0x128cea7]
[cb45769cb27a:06347] [ 6] ../aspect(_ZN6aspect9SimulatorILi2EE22assemble_stokes_systemEv+0x43f)[0x128cbef]
[cb45769cb27a:06347] [ 7] ../aspect(_ZN6aspect9SimulatorILi2EE25assemble_and_solve_stokesEbPd+0xaa)[0x13a80ea]
[cb45769cb27a:06347] [ 8] ../aspect(_ZN6aspect9SimulatorILi2EE34solve_no_advection_iterated_stokesEv+0x7a)[0x13a842a]
[cb45769cb27a:06347] [ 9] ../aspect(_ZN6aspect9SimulatorILi2EE14solve_timestepEv+0x14e)[0x12d7d5e]
[cb45769cb27a:06347] [10] ../aspect(_ZN6aspect9SimulatorILi2EE3runEv+0x300)[0x12d70d0]
[cb45769cb27a:06347] [11] ../aspect(_Z13run_simulatorILi2EEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbb+0xca)[0x10d364a]
[cb45769cb27a:06347] [12] ../aspect(main+0x33a)[0x10d2e3a]
[cb45769cb27a:06347] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f0098728b97]
[cb45769cb27a:06347] [14] ../aspect(_start+0x2a)[0x10627aa]
[cb45769cb27a:06347] *** End of error message ***

Does it tell us anything that the exception is raised from libpthread?

bangerth commented 6 years ago

Demangled, this looks as follows:

[cb45769cb27a:06347] *** Process received signal ***
[cb45769cb27a:06347] Signal: Floating point exception (8)
[cb45769cb27a:06347] Signal code: Invalid floating point operation (7)
[cb45769cb27a:06347] Failing at address: 0x135f71c
[cb45769cb27a:06347] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f0098b0a890]
[cb45769cb27a:06347] [ 1] ../aspect(aspect::Assemblers::StokesIncompressibleTerms<2>::execute(aspect::internal::Assembly::Scratch::ScratchBase<2>&, aspect::internal::Assembly::CopyData::CopyDataBase<2>&) const+0x3ec)[0x135f71c]
[cb45769cb27a:06347] [ 2] ../aspect(aspect::Simulator<2>::local_assemble_stokes_system(dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > const&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)+0x396)[0x128c3c6]
[cb45769cb27a:06347] [ 3] ../aspect(void std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > const&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)>::__call<void, dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > >&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&, 0ul, 1ul, 2ul, 3ul>(std::tuple<dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > >&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul>)+0x95)[0x129e425]
[cb45769cb27a:06347] [ 4] ../aspect(void std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > const&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)>::operator()<dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > >&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&, void>(dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > >&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)+0x51)[0x129dc21]
[cb45769cb27a:06347] [ 5] ../aspect(void dealii::WorkStream::run<std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > const&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)>, std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>))(aspect::internal::Assembly::CopyData::StokesSystem<2> const&)>, dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > >, aspect::internal::Assembly::Scratch::StokesSystem<2>, aspect::internal::Assembly::CopyData::StokesSystem<2> >(dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > > const&, dealii::identity<dealii::FilteredIterator<dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > > >::type const&, std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(dealii::TriaActiveIterator<dealii::DoFCellAccessor<dealii::DoFHandler<2, 2>, false> > const&, aspect::internal::Assembly::Scratch::StokesSystem<2>&, aspect::internal::Assembly::CopyData::StokesSystem<2>&)>, std::_Bind<void (aspect::Simulator<2>::*(aspect::Simulator<2>*, std::_Placeholder<1>))(aspect::internal::Assembly::CopyData::StokesSystem<2> const&)>, aspect::internal::Assembly::Scratch::StokesSystem<2> const&, aspect::internal::Assembly::CopyData::StokesSystem<2> const&, unsigned int, unsigned int)+0x107)[0x128cea7]
[cb45769cb27a:06347] [ 6] ../aspect(aspect::Simulator<2>::assemble_stokes_system()+0x43f)[0x128cbef]
[cb45769cb27a:06347] [ 7] ../aspect(aspect::Simulator<2>::assemble_and_solve_stokes(bool, double*)+0xaa)[0x13a80ea]
[cb45769cb27a:06347] [ 8] ../aspect(aspect::Simulator<2>::solve_no_advection_iterated_stokes()+0x7a)[0x13a842a]
[cb45769cb27a:06347] [ 9] ../aspect(aspect::Simulator<2>::solve_timestep()+0x14e)[0x12d7d5e]
[cb45769cb27a:06347] [10] ../aspect(aspect::Simulator<2>::run()+0x300)[0x12d70d0]
[cb45769cb27a:06347] [11] ../aspect(void run_simulator<2>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool)+0xca)[0x10d364a]
[cb45769cb27a:06347] [12] ../aspect(main+0x33a)[0x10d2e3a]

I don't think that the problem is inside libpthread, but only that the pthread library has installed a signal handler that cleans up the thread before it aborts the program.

bangerth commented 6 years ago

In other words, I assume that the problem happens inside the execute() function. Can you narrow things down by putting printf statements in there?

gassmoeller commented 6 years ago

@bangerth: Timo posted the line that gdb shows, but that makes no sense, because there is no floating point operation there. So it must happen at some random point before. Would printf check for FPE's?

A few more informations:

So does our test just incorrectly assume that FPEs would work on this system? Then #2225 would be the solution. Should we just go with that and make the release? I do not see a reason why clang6 should suddenly find errors that other compilers did not find before.

tjhei commented 6 years ago

I do not see a reason why clang6 should suddenly find errors that other compilers did not find before.

I assume clang is more aggressive in optimizing the code in debug mode. Without FP exceptions, it is of course legal to optimize something like

const double bdf2_factor = (use_bdf2_scheme)? ((2*time_step + old_time_step) /
                                                           (time_step + old_time_step)) : 1.0;

and always do the divide. I don't think we have a bug in our code.

Should we just go with that and make the release?

Hardcoding a check like this for a specific compiler version is not ideal. I would prefer to extend the check. Give me a plane ride to see if I can figure this out. ;-)

bangerth commented 6 years ago

Let's let @tjhei have his plane ride :-)

@gassmoeller -- no, printf doesn't fix the issue of course. I just meant this as a way to figure out in which line the problem happens -- put some printfs throughout the function and see which ones get executed before the exception happens. printf is an expensive and non-inlined function, so the compile will generally not move instructions across these calls. That means that if a particular printf shows its output, the offending instruction must indeed be in the lines that follow.

tjhei commented 6 years ago

So, my guess was correct: clang is optimizing around simple bool checks an eagerly evaluates expressions that contain floating point exceptions like the bdf2_factor above. I can work around this by moving it into a separate function, for example. Note that I am hitting similar problems in other functions...

I tried extending our FPE check to contain code similar to this, but I haven't succeeded in making it fail the check.

So what do we do? Try to disable these clang optimizations? rewrite the functions to be safe? blacklist all clang 6.0+ for FPEs?

bangerth commented 6 years ago

That's clearly a compiler bug then. I vote to just disable FPEs for clang 6, as already implemented in #2225. This has the advantage that (i) we don't further obfuscate our source code, (ii) don't penalize everyone who is using a different compiler. The number of people who would be impacted by #2225 is likely quite small, and that's useful.

tjhei commented 6 years ago

while not "fixed", let's close this with #2225 as the solution.