apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
168 stars 35 forks source link

DISC: Prefer nanobind to Cython #597

Open WillAyd opened 3 weeks ago

WillAyd commented 3 weeks ago

Wanted to open this as a topic for discussion! I've used nanobind a lot lately, and find it to be a very good library. Cython of course, has been widely used in scientific Python projects for some time, but I think suffers from a few usability issues:

  1. The Cython debugger is broken, and has been for a long time
  2. IDE support is limited
  3. The syntax for mixing in raw C code is rather awkward (ex: getting the pointer to the start of an array in Cython)
  4. Performance benefits are not always clear, and subtle differences in declarations can have drastic impacts (Cython annotations can help you inspect this, if you use them)
  5. Specific to nanoarrow, the build system has a rather complicated setup whereby it generates Cython header files from nanoarrow sources

Nanobind rather natively solves issues 1-3 above. Point 4 is partially solved by nanobind; a missing declaration wouldn't be the culprit for poor performance, but you still would need to understand the impacts of interaction with Python objects. For point 5, nanobind was also recently added to the Meson wrapdb, so if nanoarrow decided to use meson + meson-python it could drastically simplify the process of building Python extensions.

Of course, the downside to nanobind is that you are trading Python syntax for C++, which may be off-putting to some developers. Taking CArrayBuilder.append_bytes as an example, the current code looks like (N.B. the current code probably should declare code to be of type ArrowErrorCode but does not):

    def append_bytes(self, obj: Iterable[Union[str, None]]) -> CArrayBuilder:
        cdef Py_buffer buffer
        cdef ArrowBufferView item

        for py_item in obj:
            if py_item is None:
                code = ArrowArrayAppendNull(self._ptr, 1)
            else:
                PyObject_GetBuffer(py_item, &buffer, PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT)

                if buffer.ndim != 1:
                    raise ValueError("Can't append buffer with dimensions != 1 to binary array")

                if buffer.itemsize != 1:
                    PyBuffer_Release(&buffer)
                    raise ValueError("Can't append buffer with itemsize != 1 to binary array")

                item.data.data = buffer.buf
                item.size_bytes = buffer.len
                code = ArrowArrayAppendBytes(self._ptr, item)
                PyBuffer_Release(&buffer)

            if code != NANOARROW_OK:
                Error.raise_error(f"append bytes item {py_item}", code)

in nanobind, an equivalent implementation probably looks like this (N.B. this is untested):

auto AppendBytes(const CArrayBuilder &builder, nb::iterable obj) 
    ArrowErrorCode code;
    ArrowBufferView item

    for (const auto &py_item : obj) {
      if (py_item.is_none()) {
        code = ArrowArrayAppendNull(ptr_, 1);
      } else {
        Py_buffer buffer;
        PyObject_GetBuffer(py_item, &buffer, PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT);

        if (buffer.ndim != 1)
          throw nb::value_error("Can't append buffer with dimensions != 1 to binary array");

        if (buffer.itemsize != 1) {
          PyBuffer_Release(&buffer);
          throw nb::value_error("Can't append buffer with itemsize != 1 to binary array");
        }

        item.data.data = buffer.buf;
        item.size_bytes = buffer.len;
        code = ArrowArrayAppendBytes(ptr_, item);
        PyBuffer_Release(&buffer);
      }

      if (code != NANOARROW_OK)
        // TODO: would need to define NanoarrowException class as subclass of RuntimeError
        throw NanoarrowException("append bytes item " + py_item.str() + " failed " + std::to_string(code));
    }
}
paleolimbot commented 2 weeks ago

I'm definitely a +1 to moving off of Cython for all the reasons you mentioned! I am not sure I would call it a development priority at the moment...we can probably remove the automatic pxd generation in favour of what everybody else does, which is manually add Cython declarations whenever the C library changes.

One place to start is to migrate some of the low-level code out of Cython to C, like the AppendBytes() bit you picked up on, or some of the memory management/callback code. Then it can be called from Cython or by a future nanobind implementation (and in the meantime is eaiser to write and debug). Rather than THROW_NOT_OK it would do whatever Python does for this in its C API (set erroccured or something).

All just thoughts! I recently went through and modularized the Cython components...if it were to switch over to nanobind, I think we could maybe try switching over one .pyx at a time? That would probably require removing all of the cimports (probably by accessing components like some_instance._schema as <ArrowSchema*><uintptr_t>some_instance._schema_addr() or something.

WillAyd commented 2 weeks ago

In your mind are you thinking of creating these in plain C just to increase their use outside of nanoarrow? The challenge I think with plain C extensions is getting reference counting correct, especially with branches that may return early. I think an equivalent of the above code in C would look like:

static int AppendBytes(const CArrayBuilder *builder, PyObject *obj) {
    ArrowErrorCode code;
    ArrowBufferView item

    if (!PyIter_Check(obj)) {
      PyErr_SetString(PyExc_TypeError, "Object is not iterable");
      return -1;
    }

    PyObject *iterator = PyObject_GetIter(obj);
    if (iterator == NULL)
      return NULL;

    PyObject *item;
    while ((item = PyIter_Next(iterator))) {
      if (item == Py_None) {
        code = ArrowArrayAppendNull(builder->_ptr, 1);
      } else {
        Py_buffer buffer;
        if (PyObject_GetBuffer(py_item, &buffer, PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT)) {
          Py_DECREF(item);
          Py_DECREF(iterator);
          return NULL;
        }

        if (buffer.ndim != 1) {
          PyErr_SetString(PyExc_ValueError, "Can't append buffer with dimensions != 1 to binary array");
          // TODO: current implementation might be missing a release?
          // PyBuffer_Release(&buffer);  
          Py_DECREF(item);
          Py_DECREF(iterator);
          return NULL;
        }

        if (buffer.itemsize != 1) {
          PyErr_SetString(PyExc_ValueError, "Can't append buffer with itemsize != 1 to binary array");
          PyBuffer_Release(&buffer);
          Py_DECREF(item);
          Py_DECREF(iterator);
          return NULL;          
        }

        item.data.data = buffer.buf;
        item.size_bytes = buffer.len;
        code = ArrowArrayAppendBytes(ptr_, item);
        PyBuffer_Release(&buffer);
      }

      if (code != NANOARROW_OK) {
        // TODO: might be a format specifier that doesn't require PyObject_Str
        PyObject *str = PyObject_Str(item);
        // TODO: might need to create a custom exception        
        PyErr_Format(PyExc_ValueError, "append bytes item %s failed %d", str, code);
        Py_DECREF(str);
        Py_DECREF(item);
        Py_DECREF(iterator);
        return -1;
      }
      Py_DECREF(item);
    }

    Py_DECREF(iterator);
    return 0;
}

of course code use goto cleverly to better simulate an RAII cleanup of resources, but again wanted to keep this as equivalent to current code as possible

paleolimbot commented 2 weeks ago

The C Code is definitely more verbose and might not be the way to go! I am used to R's C API which is significantly worse than this, so it doesn't happen to bother me. It could also be C++ without nanobind (e.g., pyarrow's Python-specific support layer), or just straight to nanobind. The purpose is would be just to get code that doesn't have to be Cython out of Cython (to better integrate with a future Cython-free solution/make the transition doable in smaller PRs).

WillAyd commented 2 weeks ago

That's a great question. The main challenge is figuring out how to manage the Cython import / cimport statements being in an intermediate state. I think they have some pretty overloaded functionality within the context of Cython, so not sure how feasible it would even be to go piecewise in a conversion, or if it would really have to be done all at once (hopefully not)