bangerth / spec-cpuv8-sampleflow-official

A copy of the official SPEC repository that holds the 747-sampleflow benchmark.
0 stars 0 forks source link

FEValues::initialize() creates lots of lock contention. #1

Closed bangerth closed 1 month ago

bangerth commented 2 months ago

John reports that there are many many many calls to futex and sched_yield from here:

futex(addr, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 > /usr/lib64/libc.so.6(__pthread_once_slow+0x110) [0x8eec0]
 > exe(std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) [clone .constprop.0]+0xa7) [0x799777]
 > exe(dealii::Threads::Task<std::unique_ptr<dealii::FiniteElement<2, 2>::InternalDataBase, std::default_delete<dealii::FiniteElement<2, 2>::InternalDataBase> > >::Task(std::function<std::unique_ptr<dealii::FiniteElement<2, 2>::InternalDataBase, std::default_delete<dealii::FiniteElement<2, 2>::InternalDataBase> > ()> const&) [clone .lto_priv.0]+0x3bd) [0x54c0cd]
 > exe(dealii::FEValues<2, 2>::initialize(dealii::UpdateFlags)+0xa8) [0x54d998]
 > exe(dealii::FEValues<2, 2>::FEValues(dealii::Mapping<2, 2> const&, dealii::FiniteElement<2, 2> const&, dealii::Quadrature<2> const&, dealii::UpdateFlags) [clone .constprop.0]+0x52) [0x77c962]
 > exe(dealii::Functions::FEFieldFunction<2, dealii::Vector<double>, 2>::vector_value(dealii::Point<2, double> const&, dealii::Vector<double>&) const+0x179) [0x700519]
 > exe(std::_Function_handler<double (dealii::Point<3, double> const&), ForwardSimulator::PoissonSolver<2>::interpolate_to_finer_mesh(dealii::Vector<double> const&)::{lambda(dealii::Point<3, double> const&)#1}>::_M_invoke(std::_Any_data const&, dealii::Point<3, double> const&)+0xfa) [0x71f7da]

The code in question looks like this:

  // then get objects into which the FE and the Mapping can store
  // intermediate data used across calls to reinit. we can do this in parallel
  Threads::Task<
    std::unique_ptr<typename FiniteElement<dim, spacedim>::InternalDataBase>>
    fe_get_data = Threads::new_task([&]() {
      return this->fe->get_data(flags,
                                *this->mapping,
                                quadrature,
                                this->finite_element_output);
    });

  Threads::Task<
    std::unique_ptr<typename Mapping<dim, spacedim>::InternalDataBase>>
    mapping_get_data;
  if (flags & update_mapping)
    mapping_get_data = Threads::new_task(
      [&]() { return this->mapping->get_data(flags, quadrature); });

  this->update_flags = flags;

  // then collect answers from the two task above
  this->fe_data = std::move(fe_get_data.return_value());
  if (flags & update_mapping)
    this->mapping_data = std::move(mapping_get_data.return_value());
  else
    this->mapping_data =
      std::make_unique<typename Mapping<dim, spacedim>::InternalDataBase>();

The reason this is calling mutex locks so much is because we're not actually using task-based parallelism in this benchmark (other than the thread pool stuff in the sampler) but end up here:

    Task(const std::function<RT()> &function_object)
    {
      if (MultithreadInfo::n_threads() > 1)
        {
#ifdef DEAL_II_WITH_TBB
...
#else
          // If no threading library is supported, just fall back onto C++11
          // facilities. The problem with this is that the standard does
          // not actually say what std::async should do. The first
          // argument to that function can be std::launch::async or
          // std::launch::deferred, or both. The *intent* of the standard's
          // authors was probably that if one sets it to
          //   std::launch::async | std::launch::deferred,
          // that the task is run in a thread pool. But at least as of
          // 2021, GCC doesn't do that: It just runs it on a new thread.
          // If one chooses std::launch::deferred, it runs the task on
          // the same thread but only when one calls join() on the task's
          // std::future object. In the former case, this leads to
          // oversubscription, in the latter case to undersubscription of
          // resources. We choose oversubscription here.
          //
          // The issue illustrates why relying on external libraries
          // with task schedulers is the way to go.
          task_data = std::make_shared<TaskData>(
            std::async(std::launch::async | std::launch::deferred,           *************************
                       function_object));
#endif
        }
      else
        {
          // Only one thread allowed. So let the task run to completion
          // and just emplace a 'ready' future.
          //
          // The design of std::promise/std::future is unclear, but it
          // seems that the intent is to obtain the std::future before
          // we set the std::promise. So create the TaskData object at
          // the top and then run the task and set the returned
          // value. Since everything here happens sequentially, it
          // really doesn't matter in which order all of this is
          // happening.
          std::promise<RT> promise;
          task_data = std::make_shared<TaskData>(promise.get_future());
          try
            {
              internal::evaluate_and_set_promise(function_object, promise);      ********************
            }
          catch (...)
            {
              try
                {
                  // store anything thrown in the promise
                  promise.set_exception(std::current_exception());
                }
              catch (...)
                {
                  // set_exception() may throw too. But ignore this on
                  // the task.
                }
            }
        }
    }

which I believe ends up in the second marked line, which itself looks like this:

    template <typename RT, typename Function>
    void
    evaluate_and_set_promise(Function &function, std::promise<RT> &promise)
    {
      promise.set_value(function());
    }

That said, it could also be the first of the two marked lines -- investigation necessary.

This should be relatively easy to avoid at least for the benchmark. I just need to not call new_task() here, which shouldn't be doing anything anyway here.

bangerth commented 1 month ago

I was told that the following stack-trace corresponds to the issue:

  #0 in __pthread_once_slow
   #1 in std::call_once<void
       at /data1/GCC/gcc-14.2.0/include/c++/14.2.0/mutex:916
   #2 in std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base,
       at /data1/GCC/gcc-14.2.0/include/c++/14.2.0/future:435
   #3 in std::promise<std::unique_ptr<dealii::FiniteElement<2,
       at /data1/GCC/gcc-14.2.0/include/c++/14.2.0/future:1166
   #4 in dealii::Threads::internal::evaluate_and_set_promise<std::unique_ptr<dealii::FiniteElement<2,
       at dealii/include/deal.II/base/thread_management.h:921
   #5 in dealii::Threads::Task<std::unique_ptr<dealii::FiniteElement<2,
       at dealii/include/deal.II/base/thread_management.h:1095
   #6 in dealii::Threads::new_task<std::unique_ptr<dealii::FiniteElement<2,
       at dealii/include/deal.II/base/thread_management.h:1484
   #7 in dealii::Threads::new_task<dealii::FEValues<2,
       at dealii/include/deal.II/base/thread_management.h:1571
   .
   .
   .
   #69 in main
       at mcmc-laplace.cc:1293

So my hunch was right.

bangerth commented 1 month ago

Fixed by 75b36041acb9bb5589d44efa3653b56e8c072943.