apache / fury

A blazingly fast multi-language serialization framework powered by JIT and zero-copy.
https://fury.apache.org/
Apache License 2.0
3.11k stars 248 forks source link

feat(python): guide cython to optimize code generation #1905

Closed penguin-wwy closed 1 week ago

penguin-wwy commented 4 weeks ago

What does this PR do?

Optimize the C++ code generated by Cython by modifying the function implementations of the Buffer class.

Using get_int8 as an example, the following C++ code is generated before modification.

static PyObject *__pyx_pf_6pyfury_5_util_6Buffer_26get_int8(struct __pyx_obj_6pyfury_5_util_Buffer *__pyx_v_self, uint32_t __pyx_v_offset) {
  ...
  __pyx_t_1 = __pyx_f_6pyfury_5_util_6Buffer_get_int8(__pyx_v_self, __pyx_v_offset, 1);
  if (unlikely(PyErr_Occurred())) __PYX_ERR(0, 119, __pyx_L1_error)
  __pyx_t_2 = __Pyx_PyInt_From_int8_t(__pyx_t_1);
  if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 119, __pyx_L1_error)
  ...

Since get_int32 returns a C++ type, the cython needs to generate calling PyLong_FromLong, and guard code such as calling PyErr_Occurred. However, it is known that int32 can always be used to generate a PyLongObject with PyLong_FromLong. Therefore, when we manually call this, the cython only needs to check if the return value is null, as shown in the following code.

  __pyx_t_1 = __pyx_f_6pyfury_5_util_6Buffer_get_int8(__pyx_v_self, __pyx_v_offset, 1);
  if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 130, __pyx_L1_error)

Related issues

Does this PR introduce any user-facing change?

Benchmark

Microbenchmark case:

def benchmark_get_bool(buffer):
    buffer.get_bool(0)
    buffer.get_bool(1)
    buffer.get_bool(100)
    buffer.get_bool(1000)

def benchmark_get_int32(buffer):
    buffer.get_int32(0)
    buffer.get_int32(1)
    buffer.get_int32(100)
    buffer.get_int32(1000)

def benchmark_get_float(buffer):
    buffer.get_float(0)
    buffer.get_float(1)
    buffer.get_float(100)
    buffer.get_float(1000)

def benchmark_read(buffer):
    buffer.reader_index = 0
    buffer.read_int8()
    buffer.read_int16()
    buffer.read_int24()
    buffer.read_int32()
    buffer.read_int64()
    buffer.read_float()
    buffer.read_double()

Result:

# before
python benchmark.py --affinity 0
.....................
benchmark_get_bool: Mean +- std dev: 220 ns +- 4 ns
.....................
benchmark_get_int32: Mean +- std dev: 276 ns +- 10 ns
.....................
benchmark_get_float: Mean +- std dev: 254 ns +- 2 ns
.....................
benchmark_read: Mean +- std dev: 409 ns +- 11 ns

# after
python benchmark.py --affinity 0                                                                                                                                                                                                                                                                                    
.....................
benchmark_get_bool: Mean +- std dev: 215 ns +- 4 ns
.....................
benchmark_get_int32: Mean +- std dev: 264 ns +- 9 ns
.....................
benchmark_get_float: Mean +- std dev: 243 ns +- 6 ns
.....................
benchmark_read: Mean +- std dev: 380 ns +- 4 ns
chaokunyang commented 3 weeks ago

Hi @penguin-wwy , thanks for your effort. I see we replaced all specific types to generic pyobject. This may introduce overhead if we invoke from cython side. Are there some annotation to reduce checks for us automatically so we can avoiding change return types?

penguin-wwy commented 3 weeks ago

Hi @penguin-wwy , thanks for your effort. I see we replaced all specific types to generic pyobject. This may introduce overhead if we invoke from cython side. Are there some annotation to reduce checks for us automatically so we can avoiding change return types?

Yes, I found that this modification indeed disrupts cython's annotation optimization. However, I think we should write an appropriate Python microbenchmark, similar to what is done in integration_tests. This way, it will be easier to identify differences during optimization. I will draft this PR and proceed with the modifications once the microbenchmark is completed.