STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.51k stars 427 forks source link

Race condition with recent HPX #2086

Closed gentryx closed 8 years ago

gentryx commented 8 years ago

I'm seeing an invalid free error when running the LibGeoDecomp performance tests with recent HPX commits. The backtrace below was generated with commit ID 6f79ea91120f4c4212b07dea92cf4d6de299facc. It didn't occur with the same LibGeoDecomp code and the HPX trunk approx. 4 weeks ago. The invalid free goes away if I sprinkle the performance test with printf, so I assume it's a race condition and not a normal invalid free.

I can provide more details if necessary.

Error:

/var/tmp/portage/dev-util/google-perftools-2.0-r2/work/gperftools-2.0/src/tcmalloc.cc:289] Attempt to free invalid pointer 0x7faf5b674970 

Backtrace:

#0  0x00007ffff2a572e5 in raise () from /lib64/libc.so.6
#1  0x00007ffff2a5875b in abort () from /lib64/libc.so.6
#2  0x00007ffff4a3eec4 in tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem) () from /usr/lib64/libtcmalloc_minimal.so.4
#3  0x00007ffff4a3b48c in (anonymous namespace)::InvalidFree(void*) () from /usr/lib64/libtcmalloc_minimal.so.4
#4  0x00007ffff4a4d150 in tc_delete () from /usr/lib64/libtcmalloc_minimal.so.4
#5  0x00007ffff716abf0 in hpx::util::runtime_configuration::get_os_thread_count() const () from /home/inf3/gentryx/test_build_libgeodecomp_intel_mic/install/lib/libhpx.so.0
#6  0x0000000000789846 in std::vector<hpx::util::tuple<unsigned long, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long>, std::allocator<hpx::util> > hpx::parallel::util::detail::get_bulk_iteration_shape_idx<hpx::parallel::v1::parallel_execution_policy const&, hpx::lcos::future<void>, hpx::parallel::util::detail::algorithm_result<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::type hpx::parallel::v1::detail::for_each_n<boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::parallel<hpx::parallel::v1::parallel_execution_policy const&, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}, hpx::parallel::util::projection_identity>(hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}&&, hpx::parallel::util::projection_identity&&)::{lambda(unsigned long, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long)#1}&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, int>(hpx::parallel::util::detail::algorithm_result<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::type, hpx::parallel::util::detail::algorithm_result<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::type hpx::parallel::v1::detail::for_each_n<boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::parallel<hpx::parallel::v1::parallel_execution_policy const&, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}, hpx::parallel::util::projection_identity>(hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}&&, hpx::parallel::util::projection_identity&&)::{lambda(unsigned long, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long)#1}<hpx::lcos::future<void>, std::allocator<hpx::lcos::future<void> > >&, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}&&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>&, unsigned long&, int) ()
#7  0x0000000000789e31 in boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> hpx::parallel::util::detail::foreach_static_partitioner<hpx::parallel::v1::parallel_execution_policy, void>::call<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, hpx::parallel::util::detail::algorithm_result<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::type hpx::parallel::v1::detail::for_each_n<boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::parallel<hpx::parallel::v1::parallel_execution_policy const&, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}, hpx::parallel::util::projection_identity>(hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}&&, hpx::parallel::util::projection_identity&&)::{lambda(unsigned long, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long)#1}>(hpx::parallel::util::detail::algorithm_result<hpx::parallel::v1::parallel_execution_policy const&, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default> >::type, boost::iterators::counting_iterator<unsigned long, boost::iterators::use_default, boost::iterators::use_default>, unsigned long, void LibGeoDecomp::UnstructuredUpdateFunctor<UnstructuredBusyworkCellWithUpdateLineX>::apiWrapper<LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value>(LibGeoDecomp::Region<1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1> const&, LibGeoDecomp::UnstructuredGrid<UnstructuredBusyworkCellWithUpdateLineX, 1ul, double, 4, 1>*, unsigned int, LibGeoDecomp::UpdateFunctorHelpers::ConcurrencyEnableHPX const&, LibGeoDecomp::APITraits::SelectThreadedUpdate<UnstructuredBusyworkCellWithUpdateLineX, void>::Value const&, LibGeoDecomp::APITraits::TrueType)::{lambda(unsigned long)#1}&&) ()
#8  0x000000000078c24a in LibGeoDecomp::UpdateGroup<UnstructuredBusyworkCellWithUpdateLineX, LibGeoDecomp::HPXPatchLink>::update(int) ()
#9  0x0000000000721d48 in hpx::lcos::local::detail::task_object<void, hpx::util::detail::deferred<void (LibGeoDecomp::UpdateGroup<UnstructuredBusyworkCellWithUpdateLineX, LibGeoDecomp::HPXPatchLink>::*(boost::shared_ptr<LibGeoDecomp::HPXUpdateGroup<UnstructuredBusyworkCellWithUpdateLineX> >&, unsigned long&))(int)>, hpx::lcos::detail::task_base<void> >::do_run() ()
#10 0x00000000006afb0d in hpx::lcos::detail::task_base<void>::run_impl(boost::intrusive_ptr<hpx::lcos::detail::task_base<void> >) ()
#11 0x00000000006b0f54 in hpx::threads::thread_state_enum hpx::util::detail::callable_vtable<hpx::threads::thread_state_enum (hpx::threads::thread_state_ex_enum)>::invoke<hpx::util::detail::bound<hpx::threads::thread_state_enum (*(boost::intrusive_ptr<hpx::lcos::detail::task_base<void> >&&))(boost::intrusive_ptr<hpx::lcos::detail::task_base<void> >)> >(void**, hpx::threads::thread_state_ex_enum&&) ()
#12 0x00007ffff6e5aee6 in hpx::threads::coroutines::detail::coroutine_impl::operator()() () from /home/inf3/gentryx/test_build_libgeodecomp_intel_mic/install/lib/libhpx.so.0
#13 0x00007ffff6d35bf9 in void hpx::threads::coroutines::detail::lx::trampoline<hpx::threads::coroutines::detail::coroutine_impl>(hpx::threads::coroutines::detail::coroutine_impl*) ()
hkaiser commented 8 years ago

Does this happen with other allocators as well?

hkaiser commented 8 years ago

Also, how can I reproduce that?

gentryx commented 8 years ago

Steps to reproduce:

git clone https://github.com/STEllAR-GROUP/hpx.git
mkdir hpx/build

git clone https://github.com/gentryx/libgeodecomp
mkdir libgeodecomp/build

cmake -DCMAKE_INSTALL_PREFIX=$HOME/test_build_libgeodecomp_intel_mic/install -DHPX_WITH_PARCELPORT_MPI=true -DHPX_WITH_PARCELPORT_TCP=true  -DCMAKE_CXX_COMPILER=g++-5.3.0 ../
make -j20
make install
cd ../../libgeodecomp/build/
cmake -DCMAKE_CXX_COMPILER=g++-5.3.0 -DWITH_CUDA=false  -DWITH_HPX=true -DWITH_FORTRAN=true -DHPX_IGNORE_COMPILER_COMPATIBILITY=true -DCMAKE_Fortran_COMPILER=gfortran-5.3.0 -DWITH_CUDA=false -DWITH_CPP14=true -DCMAKE_PREFIX_PATH=$HOME/test_build_libgeodecomp_intel_mic/install ../
make -j16
./src/testbed/hpxperformancetests/hpxperformancetests --hpx:threads=16 0 0
gentryx commented 8 years ago

I did bisect this and git tells me:

2bf3b80ce493de23bd9fa60c056ca99dbe19b587 is the first bad commit
commit 2bf3b80ce493de23bd9fa60c056ca99dbe19b587
Author: Agustin K-ballo Berge <k@fusionfenix.com>
Date:   Tue Mar 29 20:26:50 2016 -0300

    Avoid unnecessary relocking when waiting on a detail::cv, fix missing #include ripples

:040000 040000 8d79ddc61b54f0c7dcd22025d5fff806e7ff9817 a0cba1633367367adc4f3da2e4c6f59cbc909444 M      hpx
:040000 040000 178fd74655d63e2ae8a5bbd71cd92791b6f1b155 84c3b8f3038d8f4be7ced144906f20e78d865d56 M      src
gentryx commented 8 years ago

I'll check another allocator in a bit.

hkaiser commented 8 years ago

@K-ballo have you seen this?

K-ballo commented 8 years ago

@hkaiser I've seen it now. I do not have an explanation for the race condition.

hkaiser commented 8 years ago

@gentryx A shot from the hip: could you try whether removing the lock.unlock() helps here: https://github.com/STEllAR-GROUP/hpx/blob/master/src/lcos/local/detail/condition_variable.cpp#L83, please?

gentryx commented 8 years ago

@hkaiser Changing that line did not fix the race, but it did reduce the probability a bit (tests would sometimes (30%) succeed):

I did also test different allocators: jemalloc fails with current trunk, system succeeds (maybe because it's so slow?).

hkaiser commented 8 years ago

@gentryx What happens if you use top of master with the commit reverted you identified? I still think the changes applied by that commit are correct. They do however change overall timings and thread execution sequencing, which may make the race to become visible.

hkaiser commented 8 years ago

@gentryx I tried to reproduce this on my system yesterday. Unfortunately everything works as expected :/ I'll keep trying, though.

gentryx commented 8 years ago

I'm closing this since the race seems to have been resolved on master. Thanks!

hkaiser commented 8 years ago

@gentryx The funny thing is that we have not done anything to fix it...

gentryx commented 8 years ago

@hkaiser I assume you also didn't do anything to cause it, so apparently this race condition just went back to lurking beneath the surface. I could only reproduce it on one of our test machines anyway.

gentryx commented 8 years ago

Reopening as this race resurfaced, this time on more machines. Good thing: I can reproduce it on the Marvin nodes. I'll add a script for bug reproduction momentarily.

gentryx commented 8 years ago

`[01:43:41]:aschafer@deneb01.hermione:/home/aschafer:0:$ cat test.sh

!/bin/bash

set -e

cd $HOME rm -f .cmake/packages/HPX/*

mkdir test_race_lgd_hpx cd test_race_lgd_hpx git clone https://github.com/STEllAR-GROUP/hpx.git git clone https://github.com/stellar-group/libgeodecomp

mkdir hpx/build cd hpx/build cmake -DHPX_WITH_EXAMPLES=false -DHPX_WITH_PARCELPORT_MPI=true -DHPX_WITH_PARCELPORT_TCP=true -DCMAKE_INSTALL_PREFIX=$HOME/test_race_lgd_hpx/local_install -DHPX_WITH_CXX14=false -DHPX_WITH_MALLOC=system -DBOOST_ROOT=/opt/boost/1.57.0-release/ .. make -j40 make install cd ../../

mkdir libgeodecomp/build cd libgeodecomp/build cmake -DWITH_CUDA=false -DWITH_CPP14=true -DWITH_HPX=true -DBOOST_ROOT=/opt/boost/1.57.0-release/ .. make -j40

for ((i=0;i<10;++i)); do ./src/testbed/hpxperformancetests/hpxperformancetests --hpx:threads=32 0 0 done

[01:43:45]:aschafer@deneb01.hermione:/home/aschafer:0:$ sbatch -p marvin --exclusive test.sh Submitted batch job 1714003 `

sithhell commented 8 years ago

@gentryx could you try the fix_wait_all (see #2165) branch and see if that fixes your issue?

gentryx commented 8 years ago

@sithhell I can confirm that my issue is not present on the fix_wait_all branch. :-)

hkaiser commented 8 years ago

I'm certain this fixes not only your issue but a couple of similarly dubious problems we've been seeing over the last months.

sithhell commented 8 years ago

This can be closed now as #2165 has been merged.