Add capability to write flow kernel, both in C++ and Python, and to apply it in parallel.
For Python flow kernels, rely on numba to jit data and functions, and use it without any CPython call to be able to release the GIL and thus take benefit of multithreaded execution.
Flow kernel C++ API
add a C++ API for a flow kernel (flow_kernel) and support in flow_graph to apply it on nodes using apply_kernel
the basic idea is to create a struct containing type erased data and functions to be manipulated by the flow_graph to apply the kernel on all the nodes
flow_kernel::func is the kernel itself, with signature std::function<int(void*)> it takes as unique argument the node data. it returns an error status (0 is OK)
depending on flow_kernel::n_threads, the kernel is applied sequentially or in parallel using multithreading on a custom thread pool
flow_kernel::application_order specifies the logical order for the kernel application. as-of today, only kernel_application_order::ANY and kernel_application_order::BREADTH_UPSTREAM are supported (see bellow for this enum)
Add kernel_application_order enum to allow the specification of the kernel logical application when applied on a flow graph.
ANY: the kernel can be applied in any order (best for parallel execution!)
DEPTH_DOWNSTREAM: follows an inversed depth-first-search traversal (downstream the rivers)
DEPTH_UPSTREAM: follows a depth-first-search traversal (upstream the rivers)
BREADTH_DOWNSTREAM: follows an inversed breadth-first-search traversal (downstream the rivers)
BREADTH_UPSTREAM: follows a breadth-first-search traversal (upstream the rivers)
Add capability to perform BFS traversal of the graph
add flow_graph_impl::m_bfs_indices and flow_graph_impl::m_bfs_levels members
add flow_graph_impl::compute_bfs_indices_bottomup to compute properties for BREADTH_UPSTREAM kernel application order
both indices and "levels" are stored to keep track of dependencies and parallel execution (a level contains independent nodes)
Add capability to perform a random traversal of the graph
follow the storage order for efficiency
to be used when ANY kernel application order is selected
add flow_graph_impl::m_storage_indices and flow_graph_impl::m_random_levels members, computed once in constructor (only required the grid size to be known)
add flow_graph_impl::storage_indices and flow_graph_impl::random_levels getters
Flow kernel Python API
Add Python NumbaFlowKernelFactory class in fastscapelib.flow.numba_kernel sudmodule to allow declaration of a fastscapelib flow kernel from a flow_graph and user defined data and parameters, relying on numba for Just-In-Time compilation of kernel and data to be used by C++ code:
flow_graph is the flow graph used to take references on receivers indices, count, etc. when dispatching these info in a node data from the data mapped on the underlying grid
kernel_func is the pure Python flow kernel function to be applied on each node. it must take 2 args: the node data and the time step
spec is the declaration of data types. optionally one can also declare a tuple to (type, init_value)
application_order is the logical order to fulfill when applying the kernel of the grid nodes
(optional) outputs is the sequences of data considered as flow kernel outputs. they will be collected from node data and updated in data after each kernel execution on a node
(optional) max_receivers allows to define a "static" maximum number of receivers (allows internal optimizations and speed-ups). if this value is exceeded during the computation an error is thrown
(optional) n_threads is the number of threads for parallel execution. 1 means sequential and a fastpath calling directly a numba jitted (with inlining) will be called for better performances (~20-30%), higher numbers will make the flow kernel to be applied in parallel on this number of threads
(optional) print_generated_code enables debug prints of the internal jitted function (to create a new node data, etc.)
(optional) print_stats enables to print stats about flow kernel build/compilation
Add a convenient create_flow_kernel free-function to internally call the NumbaFlowKernelFactory and return the kernel and associated data packed as a tuple.
Allow easier declaration of an eroder flow kernel from a NumbaEroderFlowKernel base class:
available from fastscapelib.eroders public API
allows to have a custom way to update kernel data, or define methods such as erode to both take as arguments the data to update on the kernel data, and return some value(s) after kernel application on the grid
an example is provided in examples/test_eroder_kernel.ipynb
All of this is made available using few intermediate classes to wrap the numba's njitted functions or jitclasses generated by NumbaFlowKernelFactory, namely:
FlowKernelData and FlowKernelNodeData are the 2 jitclasses produces to hold the kernel data and kernel node data
NumbaKernelData is a wrapper around FlowKernelData to allow an easy access to the data from Python but also expose the jitclass pointers for C++ use of the underlying data.
CPPKernelDataWrapper is a C++ binding to hold the 2 pointers of a jitclass (meminfo and data)
Parallel execution of flow kernels
The C++ flow_graph API was extended with apply_kernel<FK, FKD> template method taking 2 arguments, namely the flow kernel and its associated data. It was also added a private thread pool for any need of parallel execution.
apply_kernel dispatches the call to either apply_kernel_seq or apply_kernel_par depending on the kernel parameters n_threads and min_level_size:
n_threads correspond to the number of execution threads to be used for kernel application. Note that the main thread is currently blocking using a spin-lock to wait for execution threads to finish their jobs, effectively using n_threads+1 threads.
we could decide that n_threads-1 execution threads are used, but in this case using 2 would lead to sequential execution which is weird
maybe renaming n_threads to n_execution_threads and keeping the current implementation would be better?
min_level_size allows to set the minimum level size (number of nodes we can apply the kernel in //) from which the parallel execution is effective (bellow, the execution is sequential)
Note: the templating of apply_kernel allows to pass any kernel and data using duck typing, they only need to be consistent (kernel function must apply on the FKD type)
The thread pool is a custom lock-free and busy wait thread pool designed to maximize the throughput and minimize the latency at each apply_kernel_par call:
lock-free queues are relying on atomic_queue header-only library, added as a new dependency of this lib
the thread pool can be paused and resumed, to avoid threads to spin-lock all the time between kernels' calls
apply_kernel_par runs levels one by one, in parallel. It calls the run_blocks API of the thread pool to automatically balance the load between the threads
min_block_size kernel property can also be set to force each active thread to have at least a given size. if this size is higher than the number of nodes to execute on the level, there will be only a single active thread.
Add flow kernels API and parallel execution
Add capability to write flow kernel, both in C++ and Python, and to apply it in parallel.
For Python flow kernels, rely on
numba
to jit data and functions, and use it without anyCPython
call to be able to release the GIL and thus take benefit of multithreaded execution.Flow kernel C++ API
flow_kernel
) and support inflow_graph
to apply it on nodes usingapply_kernel
flow_graph
to apply the kernel on all the nodesflow_kernel::func
is the kernel itself, with signaturestd::function<int(void*)>
it takes as unique argument the node data. it returns an error status (0
is OK)flow_kernel::n_threads
, the kernel is applied sequentially or in parallel using multithreading on a custom thread poolflow_kernel::application_order
specifies the logical order for the kernel application. as-of today, onlykernel_application_order::ANY
andkernel_application_order::BREADTH_UPSTREAM
are supported (see bellow for this enum)kernel_application_order
enum to allow the specification of the kernel logical application when applied on a flow graph.ANY
: the kernel can be applied in any order (best for parallel execution!)DEPTH_DOWNSTREAM
: follows an inversed depth-first-search traversal (downstream the rivers)DEPTH_UPSTREAM
: follows a depth-first-search traversal (upstream the rivers)BREADTH_DOWNSTREAM
: follows an inversed breadth-first-search traversal (downstream the rivers)BREADTH_UPSTREAM
: follows a breadth-first-search traversal (upstream the rivers)flow_graph_impl::m_bfs_indices
andflow_graph_impl::m_bfs_levels
membersflow_graph_impl::compute_bfs_indices_bottomup
to compute properties forBREADTH_UPSTREAM
kernel application orderflow_graph_impl::bfs_indices
,flow_graph_impl::bfs_levels
gettersANY
kernel application order is selectedflow_graph_impl::m_storage_indices
andflow_graph_impl::m_random_levels
members, computed once in constructor (only required the grid size to be known)flow_graph_impl::storage_indices
andflow_graph_impl::random_levels
gettersFlow kernel Python API
Add Python
NumbaFlowKernelFactory
class infastscapelib.flow.numba_kernel
sudmodule to allow declaration of afastscapelib
flow kernel from aflow_graph
and user defined data and parameters, relying onnumba
for Just-In-Time compilation of kernel and data to be used by C++ code:flow_graph
is the flow graph used to take references on receivers indices, count, etc. when dispatching these info in a node data from the data mapped on the underlying gridkernel_func
is the pure Python flow kernel function to be applied on each node. it must take 2 args: the node data and the time stepspec
is the declaration of data types. optionally one can also declare a tuple to (type, init_value)application_order
is the logical order to fulfill when applying the kernel of the grid nodesoutputs
is the sequences of data considered as flow kernel outputs. they will be collected from node data and updated in data after each kernel execution on a nodemax_receivers
allows to define a "static" maximum number of receivers (allows internal optimizations and speed-ups). if this value is exceeded during the computation an error is thrownn_threads
is the number of threads for parallel execution.1
means sequential and a fastpath calling directly a numba jitted (with inlining) will be called for better performances (~20-30%), higher numbers will make the flow kernel to be applied in parallel on this number of threadsprint_generated_code
enables debug prints of the internal jitted function (to create a new node data, etc.)print_stats
enables to print stats about flow kernel build/compilationAdd a convenient
create_flow_kernel
free-function to internally call theNumbaFlowKernelFactory
and return the kernel and associated data packed as a tuple.Allow easier declaration of an eroder flow kernel from a
NumbaEroderFlowKernel
base class:fastscapelib.eroders
public APIerode
to both take as arguments the data to update on the kernel data, and return some value(s) after kernel application on the gridexamples/test_eroder_kernel.ipynb
All of this is made available using few intermediate classes to wrap the
numba
'snjit
ted functions orjitclass
es generated byNumbaFlowKernelFactory
, namely:FlowKernelData
andFlowKernelNodeData
are the 2 jitclasses produces to hold the kernel data and kernel node dataNumbaKernelData
is a wrapper aroundFlowKernelData
to allow an easy access to the data from Python but also expose the jitclass pointers for C++ use of the underlying data.CPPKernelDataWrapper
is a C++ binding to hold the 2 pointers of a jitclass (meminfo and data)Parallel execution of flow kernels
The C++
flow_graph
API was extended withapply_kernel<FK, FKD>
template method taking 2 arguments, namely the flow kernel and its associated data. It was also added a private thread pool for any need of parallel execution.apply_kernel
dispatches the call to eitherapply_kernel_seq
orapply_kernel_par
depending on the kernel parametersn_threads
andmin_level_size
:n_threads
correspond to the number of execution threads to be used for kernel application. Note that the main thread is currently blocking using a spin-lock to wait for execution threads to finish their jobs, effectively using n_threads+1 threads.2
would lead to sequential execution which is weirdn_threads
ton_execution_threads
and keeping the current implementation would be better?min_level_size
allows to set the minimum level size (number of nodes we can apply the kernel in //) from which the parallel execution is effective (bellow, the execution is sequential)Note: the templating of
apply_kernel
allows to pass any kernel and data using duck typing, they only need to be consistent (kernel function must apply on theFKD
type)The thread pool is a custom lock-free and busy wait thread pool designed to maximize the throughput and minimize the latency at each
apply_kernel_par
call:atomic_queue
header-only library, added as a new dependency of this libapply_kernel_par
runs levels one by one, in parallel. It calls therun_blocks
API of the thread pool to automatically balance the load between the threadsmin_block_size
kernel property can also be set to force each active thread to have at least a given size. if this size is higher than the number of nodes to execute on the level, there will be only a single active thread.