Open WillAyd opened 2 months ago
I'm definitely a +1 to moving off of Cython for all the reasons you mentioned! I am not sure I would call it a development priority at the moment...we can probably remove the automatic pxd generation in favour of what everybody else does, which is manually add Cython declarations whenever the C library changes.
One place to start is to migrate some of the low-level code out of Cython to C, like the AppendBytes()
bit you picked up on, or some of the memory management/callback code. Then it can be called from Cython or by a future nanobind implementation (and in the meantime is eaiser to write and debug). Rather than THROW_NOT_OK it would do whatever Python does for this in its C API (set erroccured or something).
All just thoughts! I recently went through and modularized the Cython components...if it were to switch over to nanobind, I think we could maybe try switching over one .pyx at a time? That would probably require removing all of the cimport
s (probably by accessing components like some_instance._schema
as <ArrowSchema*><uintptr_t>some_instance._schema_addr()
or something.
In your mind are you thinking of creating these in plain C just to increase their use outside of nanoarrow? The challenge I think with plain C extensions is getting reference counting correct, especially with branches that may return early. I think an equivalent of the above code in C would look like:
static int AppendBytes(const CArrayBuilder *builder, PyObject *obj) {
ArrowErrorCode code;
ArrowBufferView item
if (!PyIter_Check(obj)) {
PyErr_SetString(PyExc_TypeError, "Object is not iterable");
return -1;
}
PyObject *iterator = PyObject_GetIter(obj);
if (iterator == NULL)
return NULL;
PyObject *item;
while ((item = PyIter_Next(iterator))) {
if (item == Py_None) {
code = ArrowArrayAppendNull(builder->_ptr, 1);
} else {
Py_buffer buffer;
if (PyObject_GetBuffer(py_item, &buffer, PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT)) {
Py_DECREF(item);
Py_DECREF(iterator);
return NULL;
}
if (buffer.ndim != 1) {
PyErr_SetString(PyExc_ValueError, "Can't append buffer with dimensions != 1 to binary array");
// TODO: current implementation might be missing a release?
// PyBuffer_Release(&buffer);
Py_DECREF(item);
Py_DECREF(iterator);
return NULL;
}
if (buffer.itemsize != 1) {
PyErr_SetString(PyExc_ValueError, "Can't append buffer with itemsize != 1 to binary array");
PyBuffer_Release(&buffer);
Py_DECREF(item);
Py_DECREF(iterator);
return NULL;
}
item.data.data = buffer.buf;
item.size_bytes = buffer.len;
code = ArrowArrayAppendBytes(ptr_, item);
PyBuffer_Release(&buffer);
}
if (code != NANOARROW_OK) {
// TODO: might be a format specifier that doesn't require PyObject_Str
PyObject *str = PyObject_Str(item);
// TODO: might need to create a custom exception
PyErr_Format(PyExc_ValueError, "append bytes item %s failed %d", str, code);
Py_DECREF(str);
Py_DECREF(item);
Py_DECREF(iterator);
return -1;
}
Py_DECREF(item);
}
Py_DECREF(iterator);
return 0;
}
of course code use goto cleverly to better simulate an RAII cleanup of resources, but again wanted to keep this as equivalent to current code as possible
The C Code is definitely more verbose and might not be the way to go! I am used to R's C API which is significantly worse than this, so it doesn't happen to bother me. It could also be C++ without nanobind (e.g., pyarrow's Python-specific support layer), or just straight to nanobind. The purpose is would be just to get code that doesn't have to be Cython out of Cython (to better integrate with a future Cython-free solution/make the transition doable in smaller PRs).
That's a great question. The main challenge is figuring out how to manage the Cython import / cimport statements being in an intermediate state. I think they have some pretty overloaded functionality within the context of Cython, so not sure how feasible it would even be to go piecewise in a conversion, or if it would really have to be done all at once (hopefully not)
Wanted to open this as a topic for discussion! I've used nanobind a lot lately, and find it to be a very good library. Cython of course, has been widely used in scientific Python projects for some time, but I think suffers from a few usability issues:
Nanobind rather natively solves issues 1-3 above. Point 4 is partially solved by nanobind; a missing declaration wouldn't be the culprit for poor performance, but you still would need to understand the impacts of interaction with Python objects. For point 5, nanobind was also recently added to the Meson wrapdb, so if nanoarrow decided to use meson + meson-python it could drastically simplify the process of building Python extensions.
Of course, the downside to nanobind is that you are trading Python syntax for C++, which may be off-putting to some developers. Taking
CArrayBuilder.append_bytes
as an example, the current code looks like (N.B. the current code probably should declarecode
to be of typeArrowErrorCode
but does not):in nanobind, an equivalent implementation probably looks like this (N.B. this is untested):