Fibonacci: segfaults and scalability issues

I've been measuring the overhead of various multithreading runtimes. HPX seems to have a couple of issues with a naive fibonacci.

Expected Behavior

Fibonacci examples complete in a reasonable time

Actual Behavior

HPX either segfaults or takes a very long time for no reason

Steps to Reproduce the Problem

Problem 1: Sudden spike in time required to compute fibonacci.

Compile all the quickstart fibonacci examples
Run bin/fibonacci_one --n-value 22
Run bin/fibonacci_one --n-value 23
The time required for fibonacci_direct changes from 0.218 seconds to 8.263 s

DeepinScreenshot_select-area_20191118184749

Problem 2: Segfaults

Run fibonacci_await with any n and it segfaults

DeepinScreenshot_select-area_20191118185011

Problem 3: a rare segfault will make all the subsequent call take very long-time

I manage to segfault the fibonacci binary with a value of 30 and I couldn't reclaim control of my console even with Ctrl+C, following that the performance of everything was very degraded.

Before a segfault on fibonacci 30

DeepinScreenshot_select-area_20191118185907

After a segfault on fibonacci 30

Run bin/fibonacci --n-value 23, it takes 12s

DeepinScreenshot_select-area_20191118185630

Specifications

... Please describe your environment

HPX Version: master - git: fd67ac0f86
Platform (compiler, OS): GCC, Linux

Core library:
  HPX_WITH_ACTION_BASE_COMPATIBILITY=ON
  HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
  HPX_WITH_APEX=OFF
  HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
  HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON
  HPX_WITH_COROUTINE_COUNTERS=OFF
  HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE
  HPX_WITH_DEPRECATION_WARNINGS=ON
  HPX_WITH_GOOGLE_PERFTOOLS=OFF
  HPX_WITH_HPXMP=OFF
  HPX_WITH_IO_COUNTERS=ON
  HPX_WITH_IO_POOL=ON
  HPX_WITH_ITTNOTIFY=OFF
  HPX_WITH_LOGGING=ON
  HPX_WITH_MORE_THAN_64_THREADS=OFF
  HPX_WITH_NATIVE_TLS=ON
  HPX_WITH_NETWORKING=ON
  HPX_WITH_PAPI=OFF
  HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
  HPX_WITH_PARCELPORT_LIBFABRIC=OFF
  HPX_WITH_PARCELPORT_MPI=OFF
  HPX_WITH_PARCELPORT_MPI_MULTITHREADED=OFF
  HPX_WITH_PARCELPORT_TCP=ON
  HPX_WITH_PARCELPORT_VERBS=OFF
  HPX_WITH_PARCEL_COALESCING=ON
  HPX_WITH_PARCEL_PROFILING=OFF
  HPX_WITH_REGISTER_THREAD_COMPATIBILITY=ON
  HPX_WITH_SANITIZERS=OFF
  HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
  HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
  HPX_WITH_STACKTRACES=ON
  HPX_WITH_SWAP_CONTEXT_EMULATION=OFF
  HPX_WITH_TESTS_DEBUG_LOG=OFF
  HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF
  HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
  HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
  HPX_WITH_THREAD_DEBUG_INFO=OFF
  HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
  HPX_WITH_THREAD_GUARD_PAGE=ON
  HPX_WITH_THREAD_IDLE_RATES=OFF
  HPX_WITH_THREAD_LOCAL_STORAGE=OFF
  HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
  HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
  HPX_WITH_THREAD_STACK_MMAP=ON
  HPX_WITH_THREAD_STEALING_COUNTS=OFF
  HPX_WITH_THREAD_TARGET_ADDRESS=OFF
  HPX_WITH_TIMER_POOL=ON
  HPX_WITH_TUPLE_RVALUE_SWAP=ON
  HPX_WITH_VALGRIND=OFF
  HPX_WITH_VERIFY_LOCKS=OFF
  HPX_WITH_VERIFY_LOCKS_BACKTRACE=OFF
  HPX_WITH_VERIFY_LOCKS_GLOBALLY=OFF

  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=tcmalloc
  HPX_PREFIX (configured)=/usr/local
  HPX_PREFIX=/home/beta/Programming/Parallelism/hpx/build

Module algorithms:
  HPX_ALGORITHMS_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_ALGORITHMS_WITH_DEPRECATION_WARNINGS=OFF
  HPX_ALGORITHMS_WITH_TESTS=OFF

Module allocator_support:
  HPX_ALLOCATOR_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_ALLOCATOR_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
  HPX_ALLOCATOR_SUPPORT_WITH_TESTS=OFF

Module assertion:
  HPX_ASSERTION_WITH_COMPATIBILITY_HEADERS=ON
  HPX_ASSERTION_WITH_DEPRECATION_WARNINGS=ON
  HPX_ASSERTION_WITH_TESTS=OFF

Module basic_execution:
  HPX_BASIC_EXECUTION_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_BASIC_EXECUTION_WITH_DEPRECATION_WARNINGS=OFF
  HPX_BASIC_EXECUTION_WITH_TESTS=OFF

Module cache:
  HPX_CACHE_WITH_COMPATIBILITY_HEADERS=ON
  HPX_CACHE_WITH_DEPRECATION_WARNINGS=ON
  HPX_CACHE_WITH_TESTS=OFF

Module checkpoint:
  HPX_CHECKPOINT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_CHECKPOINT_WITH_DEPRECATION_WARNINGS=ON
  HPX_CHECKPOINT_WITH_TESTS=OFF

Module collectives:
  HPX_COLLECTIVES_WITH_COMPATIBILITY_HEADERS=ON
  HPX_COLLECTIVES_WITH_DEPRECATION_WARNINGS=ON
  HPX_COLLECTIVES_WITH_TESTS=OFF

Module compute:
  HPX_COMPUTE_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_COMPUTE_WITH_DEPRECATION_WARNINGS=ON
  HPX_COMPUTE_WITH_TESTS=OFF

Module compute_cuda:
  HPX_COMPUTE_CUDA_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_COMPUTE_CUDA_WITH_DEPRECATION_WARNINGS=ON
  HPX_COMPUTE_CUDA_WITH_TESTS=OFF

Module concepts:
  HPX_CONCEPTS_WITH_COMPATIBILITY_HEADERS=ON
  HPX_CONCEPTS_WITH_DEPRECATION_WARNINGS=ON
  HPX_CONCEPTS_WITH_TESTS=OFF

Module concurrency:
  HPX_CONCURRENCY_WITH_COMPATIBILITY_HEADERS=ON
  HPX_CONCURRENCY_WITH_DEPRECATION_WARNINGS=ON
  HPX_CONCURRENCY_WITH_TESTS=OFF

Module config:
  HPX_CONFIG_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_CONFIG_WITH_DEPRECATION_WARNINGS=OFF
  HPX_CONFIG_WITH_TESTS=OFF

Module coroutines:
  HPX_COROUTINES_WITH_COMPATIBILITY_HEADERS=ON
  HPX_COROUTINES_WITH_DEPRECATION_WARNINGS=ON
  HPX_COROUTINES_WITH_TESTS=OFF

Module datastructures:
  HPX_DATASTRUCTURES_WITH_ADAPT_STD_TUPLE=ON
  HPX_DATASTRUCTURES_WITH_COMPATIBILITY_HEADERS=ON
  HPX_DATASTRUCTURES_WITH_DEPRECATION_WARNINGS=ON
  HPX_DATASTRUCTURES_WITH_TESTS=OFF

Module debugging:
  HPX_DEBUGGING_WITH_COMPATIBILITY_HEADERS=ON
  HPX_DEBUGGING_WITH_DEPRECATION_WARNINGS=ON
  HPX_DEBUGGING_WITH_TESTS=OFF

Module errors:
  HPX_ERRORS_WITH_COMPATIBILITY_HEADERS=ON
  HPX_ERRORS_WITH_DEPRECATION_WARNINGS=ON
  HPX_ERRORS_WITH_TESTS=OFF

Module execution:
  HPX_EXECUTION_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_EXECUTION_WITH_DEPRECATION_WARNINGS=OFF
  HPX_EXECUTION_WITH_TESTS=OFF

Module filesystem:
  HPX_FILESYSTEM_WITH_BOOST_FILESYSTEM_COMPATIBILITY=OFF
  HPX_FILESYSTEM_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_FILESYSTEM_WITH_DEPRECATION_WARNINGS=OFF
  HPX_FILESYSTEM_WITH_TESTS=OFF

Module format:
  HPX_FORMAT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_FORMAT_WITH_DEPRECATION_WARNINGS=ON
  HPX_FORMAT_WITH_TESTS=OFF

Module functional:
  HPX_FUNCTIONAL_WITH_COMPATIBILITY_HEADERS=ON
  HPX_FUNCTIONAL_WITH_DEPRECATION_WARNINGS=ON
  HPX_FUNCTIONAL_WITH_TESTS=OFF

Module hardware:
  HPX_HARDWARE_WITH_COMPATIBILITY_HEADERS=ON
  HPX_HARDWARE_WITH_DEPRECATION_WARNINGS=ON
  HPX_HARDWARE_WITH_TESTS=OFF

Module hashing:
  HPX_HASHING_WITH_COMPATIBILITY_HEADERS=ON
  HPX_HASHING_WITH_DEPRECATION_WARNINGS=ON
  HPX_HASHING_WITH_TESTS=OFF

Module iterator_support:
  HPX_ITERATOR_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_ITERATOR_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
  HPX_ITERATOR_SUPPORT_WITH_TESTS=OFF

Module logging:
  HPX_LOGGING_WITH_COMPATIBILITY_HEADERS=ON
  HPX_LOGGING_WITH_DEPRECATION_WARNINGS=ON
  HPX_LOGGING_WITH_TESTS=OFF

Module memory:
  HPX_MEMORY_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_MEMORY_WITH_DEPRECATION_WARNINGS=ON
  HPX_MEMORY_WITH_TESTS=OFF

Module plugin:
  HPX_PLUGIN_WITH_COMPATIBILITY_HEADERS=ON
  HPX_PLUGIN_WITH_DEPRECATION_WARNINGS=ON
  HPX_PLUGIN_WITH_TESTS=OFF

Module preprocessor:
  HPX_PREPROCESSOR_WITH_COMPATIBILITY_HEADERS=ON
  HPX_PREPROCESSOR_WITH_DEPRECATION_WARNINGS=ON
  HPX_PREPROCESSOR_WITH_TESTS=OFF

Module program_options:
  HPX_PROGRAM_OPTIONS_WITH_BOOST_PROGRAM_OPTIONS_COMPATIBILITY=ON
  HPX_PROGRAM_OPTIONS_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_PROGRAM_OPTIONS_WITH_DEPRECATION_WARNINGS=OFF
  HPX_PROGRAM_OPTIONS_WITH_TESTS=OFF

Module resiliency:
  HPX_RESILIENCY_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_RESILIENCY_WITH_DEPRECATION_WARNINGS=OFF
  HPX_RESILIENCY_WITH_TESTS=OFF

Module resource_partitioner:
  HPX_RESOURCE_PARTITIONER_WITH_COMPATIBILITY_HEADERS=ON
  HPX_RESOURCE_PARTITIONER_WITH_DEPRECATION_WARNINGS=ON
  HPX_RESOURCE_PARTITIONER_WITH_TESTS=OFF

Module segmented_algorithms:
  HPX_SEGMENTED_ALGORITHMS_WITH_COMPATIBILITY_HEADERS=OFF
  HPX_SEGMENTED_ALGORITHMS_WITH_DEPRECATION_WARNINGS=OFF
  HPX_SEGMENTED_ALGORITHMS_WITH_TESTS=OFF

Module serialization:
  HPX_SERIALIZATION_WITH_BOOST_TYPES=ON
  HPX_SERIALIZATION_WITH_COMPATIBILITY_HEADERS=ON
  HPX_SERIALIZATION_WITH_DEPRECATION_WARNINGS=ON
  HPX_SERIALIZATION_WITH_TESTS=OFF

Module statistics:
  HPX_STATISTICS_WITH_COMPATIBILITY_HEADERS=ON
  HPX_STATISTICS_WITH_DEPRECATION_WARNINGS=ON
  HPX_STATISTICS_WITH_TESTS=OFF

Module testing:
  HPX_TESTING_WITH_COMPATIBILITY_HEADERS=ON
  HPX_TESTING_WITH_DEPRECATION_WARNINGS=ON
  HPX_TESTING_WITH_TESTS=OFF

Module thread_support:
  HPX_THREAD_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_THREAD_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
  HPX_THREAD_SUPPORT_WITH_TESTS=OFF

Module threadmanager:
  HPX_THREADMANAGER_WITH_COMPATIBILITY_HEADERS=ON
  HPX_THREADMANAGER_WITH_DEPRECATION_WARNINGS=ON
  HPX_THREADMANAGER_WITH_TESTS=OFF

Module timing:
  HPX_TIMING_WITH_COMPATIBILITY_HEADERS=ON
  HPX_TIMING_WITH_DEPRECATION_WARNINGS=ON
  HPX_TIMING_WITH_TESTS=OFF

Module topology:
  HPX_TOPOLOGY_WITH_COMPATIBILITY_HEADERS=ON
  HPX_TOPOLOGY_WITH_DEPRECATION_WARNINGS=ON
  HPX_TOPOLOGY_WITH_TESTS=OFF

Module type_support:
  HPX_TYPE_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
  HPX_TYPE_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
  HPX_TYPE_SUPPORT_WITH_TESTS=OFF

Module util:
  HPX_UTIL_WITH_COMPATIBILITY_HEADERS=ON
  HPX_UTIL_WITH_DEPRECATION_WARNINGS=ON
  HPX_UTIL_WITH_TESTS=OFF

{version}: V1.4.0-trunk (AGAS: V3.0), Git: fd67ac0f86
{boost}: V1.71.0
{build-type}: release
{date}: Nov 18 2019 14:04:57
{platform}: linux
{compiler}: GNU C++ version 9.2.0
{stdlib}: GNU libstdc++ version 20190812

Thanks for reporting! However I don't think that the trivial Fibonacci implementation is a good benchmark for any runtime system... what is your real usecase?

It's a good benchmark to measure the overhead of runtimes.

I'm currently building my own runtime for the Nim programming language with a focus on very-fine grained parallelism. My goal is to expose parallelism to users even for tasks in the 1000 cycles as while we get more cores the minimum grain size is an issue.

Given that fibonacci is a benchmark with only a single instruction, it's actually a pure overhead benchmark to compare runtimes on.

For reference on fibonacci(40) on my machine:

GCC OpenMP doesn't complete (due to the absence of load balancing and the use of a global queue protected by a single lock)
LLVM OpenMP takes 2.1 sec
Intel TBB takes either 600 msec or 1 sec depending if we use a closures or a class approach

For reference among the less known C and C++ runtimes

Staccato takes 180 ms (Paper is paywalled unfortunately: Staccato: Cache-Aware Work-Stealing Task Scheduler for Shared-Memory System: http://dx.doi.org/10.1007/978-3-319-95171-3_8)
Fibril takes 130 ms (Paper: A Practical Solution to the Cactus Stack Problem - http://chaoran.me/assets/pdf/ws-spaa16.pdf)

Both achieve those impressively low overheads by taking great attention to cache and memory locality.

On my own WIP runtime I also achieve 180 ms at the moment, however as my runtime is message/channels based instead of atomics/shared-memory based and I want portability across OSes and CPU arch, the main optimization in fibril is not applicable as it relies on both Linux and the x86-64 calling convention (but may be emulable via setjmp/longjmp).

My runtime is being developed here: https://github.com/mratsim/weave

As HPX / ParalleX is from the documentation "message-driven" I was interested in how it fares on the fibonacci overhead front.

As HPX launches a new 'thread' for each invocation of the simple Fibonacci (this includes a stack allocation and a full future instantiation) implementation it will compare very badly against anything that does not do that. The overhead of one task/thread in HPX is about 1 microsecond per task, you can measure that easily using the future-overheads benchmark (https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/future_overhead.cpp).

BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...

BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...

The point is to get an idea of the worst case scenario of scheduling overhead for tasks that are too fine-grained for the runtime system.

For some algorithms especially the recursive ones, for example a tree algorithm, a user may not or maybe cannot size the tasks into sufficiently large chunks to amortize the runtime overhead because by the time it is known that we are at a leaf computation that is too small to be efficiently scheduled by traditional runtime, it is too late.

This is exacerbated by the growing number of cores available in current systems, which requires developers to create more tasks. A runtime with low overhead will greatly assist that by allowing developers to confidently create new tasks and parallelism opportunities knowing that the runtime will handle it efficiently instead of choking.

Now, optimizing for such scenarios might not be one of HPX priorities, just like GCC OpenMP also cannot handle fibonacci. But given how many of the quickstart examples are based on naive fibonacci, I think it would be valuable to address that overhead or at least mention in the doc that fibonacci is an unrealistic workload and that HPX recommends a minimum task size of x cycles to amortize scheduling overhead (for example TBB mentions 10000 cycles).

@mratsim Sure, I understand where you're coming from. The main question that interests me in the context of this discussion however would be where other runtimes 'cut corners' by operating under well defined constraints that allow for optimizations. We usually can't apply those optimizations to HPX because of its general purpose nature (at least not in the general case). OpenMP (before V5) for instance can rely on the fact that no task will exit the scope it was created in (which is the case for your Fibonacci example). This alone allows for significant optimizations. There are other implicit assumptions that people rely on while implementing their runtimes that give some (possibly non-trivial) benefit.

All of this said - we'd be more than happy to accept pull requests for special case scheduler implementations that a user could explicitly ask for in certain, well defined situations.

Yes of course, for example Staccato requires at the moment to declare the computation as a task and also indicate at compile-time how many tasks an async computation can spawn (for example fibonacci is 1 if we only fork(n-1) or 2 if we fork(n-2). This is used with great effect to be able to use a StackAllocator. More special case: fibril requires x86-64 calling convention + assembly to implement resumable tasks from a cactus stack and also mmap/madvise which doesn't work on Windows (but can be worked around probably).

And obviously both assume shared-memory.

Now regarding my own runtime, there are 2 keys part:

Batching stealed tasks so that the overhead is amortized on many tasks. This is covered in the following thesis "Embracing Explicit Communication in Work-Stealing Runtime Systems" by Andreas Prell (@aprell). C implementation at https://github.com/aprell/tasking-2.0, the important part is an adaptative switch between steal-one and steal-half of the tasks: https://github.com/aprell/tasking-2.0/blob/303ce3e09a7c118241b9d5482219e3e3b848212e/src/runtime.c#L791-L828.
Avoiding allocations, either by caching mechanism or using alloca for nested futures until it is actually required to allocate them because they are about to escape their scopes (for example they are being stolen by another thread).

I'm not aware of any other implicit tradeoffs in my own runtime though the new scheduler from Intel/Julia Partr does raise some intriguing points at 12:30 of this presentation (https://youtu.be/YdiZa0Y3F3c?t=751) and in their forum (https://discourse.julialang.org/t/question-regarding-julias-planned-approach-to-composable-safe-and-easy-multithreading/17692/4)

Thanks for the links!

You're welcome, I actually have plenty more here but I fear there is no enough time in a year to digest them all: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md

I was re-reading on old comparison of multithreading runtimes from Sandia National Laboratories, a US-led scientific program, at https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf

They had to remove HPX (and GCC OpenMP and Nemesis QThreads) to analyze the other runtimes (Cilk, Intel OpenMP and Intel TBB). It would be nice to address that. DeepinScreenshot_select-area_20191204015635

Let's work on improving these particular use cases in HPX first. Then I'd be more than happy to have those comparisons done again. That comparison was made a long time ago, I think we can do much better nowadays.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

STEllAR-GROUP / hpx