digikar99 / py4cl2-cffi

CFFI based alternative to py4cl2
43 stars 10 forks source link

TODO optimization #12

Open jcguu95 opened 9 months ago

jcguu95 commented 9 months ago

In readme, the last incomplete todo is optimizing pythonizers and lispifiers using static-dispatch.

Would you mind explaining what the goal is? Even if we use static dispatch, the type of the object to be pythonized/lispified has to be determined at runtime. That means we still need to use an if clause to judge which branch to go. Would that optimize anything?

digikar99 commented 9 months ago

I should rewrite that TODO as just "optimization" (EDIT: Done in https://github.com/digikar99/py4cl2-cffi/commit/163e52b9d63a9d4aa2d8f4d40ce7e31a4841eaf7). Apparantly, PyCall.jl is just 1.5 times slower* than native cpython; even though py4cl2-cffi might be 10 times faster than py4cl, it is still 10-20 times slower than cpython. So, there's much scope for optimization. I haven't investigated the reason for the poor performance in any detail yet.

*I'm measuring the time to run the following loop and its julia/PyCall.jl equivalent:

(time
 (loop for i below 1000000
   do (pycall "str" i)))

As for the static dispatch itself -

Even if we use static dispatch, the type of the object to be pythonized/lispified has to be determined at runtime. That means we still need to use an if clause to judge which branch to go. Would that optimize anything?

Because lisp allows type derivation, either through SBCL specific code or through cl-form-types based CLTL2 mechanisms, it might be possible to static dispatch and inline the pythonizers. Statically dispatching lispifiers might require type derivation on the python side, or more realistically, knowing the result type of the python call through signature of the python callable. All of this still requires compiler macros and perhaps, CLTL2 as well.

jcguu95 commented 9 months ago

How did you get the number 10-20?

digikar99 commented 9 months ago

The following python code

import time

def call_n_times(n):
    start = time.time()
    for i in range(n):
        x = str(i)
    return time.time() - start

print(call_n_times(1000000))

executes about 20 times faster than the corresponding py4cl2-cffi code:

(time
 (loop for i below 1000000
   do (pycall "str" i)))
jcguu95 commented 8 months ago

Just a guess. From this it seems that python and lisp communicate through a named pipe, right? That is IPC and maybe slower than what we can get. Since python and common lisp are running as the same process, there should be a more efficient way. Any thoughts?

digikar99 commented 8 months ago

Perhaps those should be renamed to *py-stdout-stream-pipe* and *py-stderr-stream-pipe*, coupled with other corresponding renamings.

So, these are only used to send the python's stdout and stderr asynchronously and won't have any impact for the above performance disparity.

I have been trying to work on the performance aspect, but I have only been able to achieve a 1.5-2x speedup, rather than the required 10x. Let me commit my attempt so far into a separate branch. I haven't debugged why those changes make the tests fail, so I cannot merge it into the master branch yet.

digikar99 commented 8 months ago

This is up: https://github.com/digikar99/py4cl2-cffi/tree/optim

The three commits alone account for almost a 2x improvement in performance as measured by the (pycall "str" i) code above. I have other miscellaneous optimizations that account for another 50% improvement, with the overall improvement reaching up to 2.5-3x - I hope to investigate and commit them later in the week.

digikar99 commented 8 months ago

If you or any reader is interested in this, try stepping through and profiling the function calls involved in pycall-ing.

Alternatively, you can contribute to cl-python and make it so that it can call CPython libraries.

digikar99 commented 8 months ago

So, before trying to optimize as much as possible, I think it would also be appropriate to check how fast we can actually go. Here's an attempt:

(defun fast-pystr (o)
  (declare (optimize speed))    
  (unless (typep o 'c-long)
    (warn "Given integer ~S is too big to be interpreted as a C long" o))
  (with-pygc
    (with-python-gil
      (let* ((pylong (foreign-funcall "PyLong_FromLong" :long o :pointer))
             (result (foreign-funcall-varargs
                      "PyObject_CallFunctionObjArgs"
                      (:pointer #.(pyvalue* "str"))
                      :pointer pylong
                      :pointer #.(null-pointer)
                      :pointer)))
        (pytrack pylong)
        (pytrack result)
        (nth-value 0
                   (foreign-string-to-lisp
                    (pyforeign-funcall "PyUnicode_AsUTF8"
                                       :pointer result
                                       :pointer)))))))

(fast-pystr 42) ;=> "42"

100,000 calls of this take atleast 0.5-0.6 seconds on my i7-8750H when locked to 0.8GHz.

(time
 (with-pygc
   (loop :for i below 100000
         do (fast-pystr i))))
Evaluation took:
  0.595 seconds of real time
  0.593025 seconds of total run time (0.577204 user, 0.015821 system)
  99.66% CPU
  1,309,570,346 processor cycles
  22,249,296 bytes consed

(EDIT) Here's a second attempt:

(defun fast-pystr (o)
  (declare (optimize speed))    
  (let* ((result (foreign-funcall-varargs
                  "PyObject_CallFunction"
                  (:pointer #.(pyvalue* "str")
                            :string "l")
                  :long o
                  :pointer)))
    result
    ;; (pytrack result)
    ;; (nth-value 0
    ;;            (foreign-string-to-lisp
    ;;             (pyforeign-funcall "PyUnicode_AsUTF8"
    ;;                                :pointer result
    ;;                                :pointer)))
    ))

Now, this takes a 0.18 seconds. Uncommenting the part on pytrack and converting to lisp takes makes it take about 0.37 seconds.

(time
 (with-python-gil
   (with-pygc
     (loop :for i below 100000
           do (fast-pystr i)))))
Evaluation took:
  0.188 seconds of real time
  0.185572 seconds of total run time (0.169664 user, 0.015908 system)
  98.94% CPU
  409,723,108 processor cycles
  3,211,264 bytes consed

The equivalent python code still takes 0.06-0.07 seconds:

import time

def pystr(o):
    return str(o)

def call_n_times(n):
    start = time.time()
    for i in range(n):
        x = pystr(i)
    return time.time() - start

print(call_n_times(100000))

Julia with PyCall.jl takes around 0.13-0.14 seconds:

using PyCall

function call_pystr(n)
    for i = 1:n
        x = pystr(PyObject(i))
    end
end

@time call_pystr(100000)
@time call_pystr(100000)
digikar99 commented 8 months ago

Here's apparantly the fastest way to do it:

(defun fast-pystr (o)
  (declare (optimize speed))
  (let* ((pylong (foreign-funcall "PyLong_FromLong" :long o :pointer))
         (result (foreign-funcall "PyObject_Str"
                                  :pointer pylong
                                  :pointer)))
    result
    ;; (pytrack result)
    ;; (nth-value 0
    ;;            (foreign-string-to-lisp
    ;;             (pyforeign-funcall "PyUnicode_AsUTF8"
    ;;                                :pointer result
    ;;                                :pointer)))
    ))
(time
 (with-python-gil
   (with-pygc
     (loop :for i below 100000
           do (fast-pystr i)))))
Evaluation took:
  0.075 seconds of real time
  0.075608 seconds of total run time (0.071624 user, 0.003984 system)
  101.33% CPU
  166,921,062 processor cycles
  1,605,632 bytes consed

And looks like we were doing an unfair comparison with julia's PyCall!

using PyCall

function call_pycall_str(n)
    for i = 1:n
        x = py"str"(PyObject(i))
    end
end

@time call_pycall_str(10000)
@time call_pycall_str(10000)

This takes around 0.5 seconds to run for each 10000 calls!

Nonetheless, this puts forward another key for optimization: if C builtin functions are available, compile the appropriate pycall form to a form that calls the C function directly.

digikar99 commented 7 months ago

Here is the summary!