Open PMeira opened 6 years ago
A Cython implementation will be investigated, especially since it's better integrated with NumPy.
After studying the different usages of CFFI, it becomes apparent that a ctypes implementation wouldn't help much. One of CFFI's modes actually use ctypes directly, so we could just use that to test the speed if really necessary.
An initial Cython implementation was added in the cython branch.
Initial observations from the somewhat naive implementation: Cython generates a huge C file (200k lines) compared to CFFI, and it takes a while to compile it (the CFFI file compiles instantly).
A quick set of microbenchmarks (reading most of the implemented properties) shows it is usually faster than CFFI. When both the benchmarks and this implementation are more complete, we can decide if it is worth to more to Cython. Ideally we should use a set of real-world scripts to benchmark the implementation as whole.
I also ran the benchmark with PyPy on Windows. Using lists instead of NumPy arrays (didn't want to spend time building NumPy on Windows) resulted in a 4-20x speedup compared to CPython.
Added Pybind11 as another target of this comparison. Some libraries have moved from Cython and CFFI to it and usually people are happy. Since KLUSolve already needs a C++ compiler and we're trying to provide binary packages for most platforms, the C++ dependency wouldn't be an issue.
It seems the call overhead of Pybind11 is higher than both Cython and CFFI.
Almost 2 years later, the most likely future is to move to Cython after updating and benchmarking the current versions. The main advantage of Cython is that it creates the NumPy arrays at the C side, removing some overhead.
Since this is not urgent, we can wait for the initial results of the HPy project to decide. CFFI is very convenient and I'd like to keep it if possible. If HPy develops enough by the end of the year, fully migrating performance-dependent code to PyPy would be the recommendation.
If Cython 3.0 is released before the final 0.11 release, I'll retest it and probably use Cython as the main supporting code.
Looking at the size of the CFFI binaries, though, I think it is important to keep them too. One reason is backwards compatibility, another is that users would need a compiler to achieve some things that already can be done with CFFI. As long as the same underlying DSS C-API binary is used, things should work smoothly, and hopefully faster for many use-cases.
Moving back to 0.11.
DragonFFI is close to how CFFI works, but uses LLVM instead to process the headers/code: https://github.com/aguinet/dragonffi
It uses Pybind11. Probably worth checking on CPython.
If we generate the code ourselves or use one of existing tools, due to the high number of variants required here -- (ARM32, ARM64, x86, x86-64) x (Windows, Linux, macOS) -- Py_LIMITED_API
(#24) should be a requirement.
Since Cython is a bit slow to progress, the other options will be re-evaluated, as well as generating a full Python module in C/C++ for the CPython API or HPy. Most of the API-binding code for DSS Python is trivial and the main Cython advantage is the creation of NumPy arrays in C. So, provided we can do that without too many issues, there is no way a faster API can be developed for CPython.
The current CFFI implementation can be kept here for PyPy and others, and that can be used for benchmarks and comparisons. We should also setup a comparison across different CPython versions.
Related: https://github.com/dss-extensions/dss_matlab/issues/13
Cython reached 3.0 beta stage, yet Py_LIMITED_API
is not supported. It's a good point to retest the performance though.
pybind11 seems to support Py_LIMITED_API
. I also found https://github.com/google/pywrapcc with some interesting comments.
Nowadays our codebase is mature enough to consider keeping the C extension code. Our Python C-API usage is minimal. Creating NumPy arrays directly would be good; if there is something like .NET's ArrayPool, if certain would be beneficial. It could be fun to test HPy too.
Considering we're moving to C++ in the engine, nanobind seems to fit well and has integration with some important packages already: https://github.com/wjakob/nanobind
This would move some code of the Python code to C++, but actually simplify some features, like the error checks.
I also tested a simple function to convert the errors directly to exceptions using Python's C-API:
void AltDSSPy_HandleError(int32_t errorNumber, const char* errorMessage)
{
PyGILState_STATE gstate = PyGILState_Ensure();
PyObject *util_mod = PyImport_AddModule("dss._cffi_api_util");
PyObject *dss_exc = PyObject_GetAttrString(util_mod, "DSSException");
PyObject *exc_args = PyTuple_New(2);
PyTuple_SetItem(exc_args, 0, PyLong_FromLong(errorNumber));
PyTuple_SetItem(exc_args, 1, PyUnicode_FromString(errorMessage));
PyErr_SetObject(dss_exc, exc_args);
Py_DECREF(dss_exc);
PyGILState_Release(gstate);
}
This works fine, but we need to adapt the Python-facing function to return NULL, otherwise we get a SystemError
exception too. At this point, it loses its utility.
This will be done after we fully migrate to C++ to avoid introducing even more changes in the current implementation.
(title of the issue updated)
Try to write a implementation based on ctypes and/or Cython to compare performance of the bindings on CPython and PyPy.
Investigate whether Numba's CFFI support can be useful to speed up anything.
Document pain points for each approach.