Closed MothNik closed 4 months ago
I changed the focus a bit after finding out that out-of-the-box solutions for splines are roughly as fast as the recursion formulaes applied for the Hermite functions even when almost no knots are used.
The NumPy implementation is hard to parallelize given the natural limitations of Python's parallelization (at least so far). On the other hand, the Numba implementation was parallelizable, but after inspecting the parallel diagnostics it turned out that the inner loop was never parallelized. Probably the compiler was cautious about it.
To be maximum flexible, a multi-threaded Cython-version was implemented which scales nicely. Already for low orders, the execution time is fully divided by the numbers of threads used (4 in this case), i.e., the parallelization overhead is way smaller than the actual compuations:
Currently, the Hermite function computation is a bit slow for moderate numbers of data points and orders:
Admittedly, all the involved logarithms that are required for numerical stability are quite a limiting step.
However, the mathematics behind the Hermite functions will not change anytime soon and therefore, it should be possible to pre-compute the practically relevant ones (e.g. orders of up to 50,000) in terms of, e.g., B-splines. With a well-defined set of pre-computed coefficients and knots, the complexity of the evaluations could maybe be reduced.
One first approach will be to try the fast knot placements strategy proposed in this publication.