Add flow kernels API and parallel execution

Add capability to write flow kernel, both in C++ and Python, and to apply it in parallel.

For Python flow kernels, rely on numba to jit data and functions, and use it without any CPython call to be able to release the GIL and thus take benefit of multithreaded execution.

Flow kernel C++ API

add a C++ API for a flow kernel (flow_kernel) and support in flow_graph to apply it on nodes using apply_kernel
- the basic idea is to create a struct containing type erased data and functions to be manipulated by the flow_graph to apply the kernel on all the nodes
- flow_kernel::func is the kernel itself, with signature std::function<int(void*)> it takes as unique argument the node data. it returns an error status (0 is OK)
- depending on flow_kernel::n_threads, the kernel is applied sequentially or in parallel using multithreading on a custom thread pool
- flow_kernel::application_order specifies the logical order for the kernel application. as-of today, only kernel_application_order::ANY and kernel_application_order::BREADTH_UPSTREAM are supported (see bellow for this enum)
Add kernel_application_order enum to allow the specification of the kernel logical application when applied on a flow graph.
- ANY: the kernel can be applied in any order (best for parallel execution!)
- DEPTH_DOWNSTREAM: follows an inversed depth-first-search traversal (downstream the rivers)
- DEPTH_UPSTREAM: follows a depth-first-search traversal (upstream the rivers)
- BREADTH_DOWNSTREAM: follows an inversed breadth-first-search traversal (downstream the rivers)
- BREADTH_UPSTREAM: follows a breadth-first-search traversal (upstream the rivers)
Add capability to perform BFS traversal of the graph
- add flow_graph_impl::m_bfs_indices and flow_graph_impl::m_bfs_levels members
- add flow_graph_impl::compute_bfs_indices_bottomup to compute properties for BREADTH_UPSTREAM kernel application order
- both indices and "levels" are stored to keep track of dependencies and parallel execution (a level contains independent nodes)
- add flow_graph_impl::bfs_indices, flow_graph_impl::bfs_levels getters
Add capability to perform a random traversal of the graph
- follow the storage order for efficiency
- to be used when ANY kernel application order is selected
- add flow_graph_impl::m_storage_indices and flow_graph_impl::m_random_levels members, computed once in constructor (only required the grid size to be known)
- add flow_graph_impl::storage_indices and flow_graph_impl::random_levels getters

Flow kernel Python API

Add Python NumbaFlowKernelFactory class in fastscapelib.flow.numba_kernel sudmodule to allow declaration of a fastscapelib flow kernel from a flow_graph and user defined data and parameters, relying on numba for Just-In-Time compilation of kernel and data to be used by C++ code:

flow_graph is the flow graph used to take references on receivers indices, count, etc. when dispatching these info in a node data from the data mapped on the underlying grid
kernel_func is the pure Python flow kernel function to be applied on each node. it must take 2 args: the node data and the time step
spec is the declaration of data types. optionally one can also declare a tuple to (type, init_value)
application_order is the logical order to fulfill when applying the kernel of the grid nodes
(optional) outputs is the sequences of data considered as flow kernel outputs. they will be collected from node data and updated in data after each kernel execution on a node
(optional) max_receivers allows to define a "static" maximum number of receivers (allows internal optimizations and speed-ups). if this value is exceeded during the computation an error is thrown
(optional) n_threads is the number of threads for parallel execution. 1 means sequential and a fastpath calling directly a numba jitted (with inlining) will be called for better performances (~20-30%), higher numbers will make the flow kernel to be applied in parallel on this number of threads
(optional) print_generated_code enables debug prints of the internal jitted function (to create a new node data, etc.)
(optional) print_stats enables to print stats about flow kernel build/compilation

Add a convenient create_flow_kernel free-function to internally call the NumbaFlowKernelFactory and return the kernel and associated data packed as a tuple.

Allow easier declaration of an eroder flow kernel from a NumbaEroderFlowKernel base class:

available from fastscapelib.eroders public API
allows to have a custom way to update kernel data, or define methods such as erode to both take as arguments the data to update on the kernel data, and return some value(s) after kernel application on the grid
an example is provided in examples/test_eroder_kernel.ipynb

All of this is made available using few intermediate classes to wrap the numba's njitted functions or jitclasses generated by NumbaFlowKernelFactory, namely:

FlowKernelData and FlowKernelNodeData are the 2 jitclasses produces to hold the kernel data and kernel node data
NumbaKernelData is a wrapper around FlowKernelData to allow an easy access to the data from Python but also expose the jitclass pointers for C++ use of the underlying data.
CPPKernelDataWrapper is a C++ binding to hold the 2 pointers of a jitclass (meminfo and data)

Parallel execution of flow kernels

The C++ flow_graph API was extended with apply_kernel<FK, FKD> template method taking 2 arguments, namely the flow kernel and its associated data. It was also added a private thread pool for any need of parallel execution.

apply_kernel dispatches the call to either apply_kernel_seq or apply_kernel_par depending on the kernel parameters n_threads and min_level_size:

n_threads correspond to the number of execution threads to be used for kernel application. Note that the main thread is currently blocking using a spin-lock to wait for execution threads to finish their jobs, effectively using n_threads+1 threads.
- we could decide that n_threads-1 execution threads are used, but in this case using 2 would lead to sequential execution which is weird
- maybe renaming n_threads to n_execution_threads and keeping the current implementation would be better?
min_level_size allows to set the minimum level size (number of nodes we can apply the kernel in //) from which the parallel execution is effective (bellow, the execution is sequential)

Note: the templating of apply_kernel allows to pass any kernel and data using duck typing, they only need to be consistent (kernel function must apply on the FKD type)

The thread pool is a custom lock-free and busy wait thread pool designed to maximize the throughput and minimize the latency at each apply_kernel_par call:

lock-free queues are relying on atomic_queue header-only library, added as a new dependency of this lib
the thread pool can be paused and resumed, to avoid threads to spin-lock all the time between kernels' calls
apply_kernel_par runs levels one by one, in parallel. It calls the run_blocks API of the thread pool to automatically balance the load between the threads
min_block_size kernel property can also be set to force each active thread to have at least a given size. if this size is higher than the number of nodes to execute on the level, there will be only a single active thread.

fastscape-lem / fastscapelib