digikar99 / py4cl2

Call python from Common Lisp
https://digikar99.github.io/py4cl2/
Other
41 stars 9 forks source link

Ideas: Parallel Executions #22

Open jcguu95 opened 1 year ago

jcguu95 commented 1 year ago

About parallel execution, there are many details and options at hand. I have played with some of them. And currently, I think it is best to use the following scheme. I am curious what you think about this design. If we are on the same page, I can start writing and make a pull request.

1. Python takes message one at a time, via a pipe.

This design is currently adopted by py4cl and py4cl2. It is not message oriented, but it is really fast:

(time (loop repeat 10000 do (py4cl2:pyeval 0))) 
Evaluation took: 0.171 seconds of real time 

and it seems to be robust enough after (commit: 482d4b).

I have tried making a python HTTP server with FastAPI, it takes roughly 200 dummy requests per second (much slower than above). I also thought about using ZeroMQ, which is widely claimed to be very fast and message oriented. Three issues are

  1. The current way is very fast and seems to be robust enough.
  2. It won't be faster than the current plain method as it introduces some abstractions.
  3. cl-zmq and pzmq do not work out of the box on my machine (mac air M2, brew, roswell), and I did not find an easy way to solve it.

Therefore, I suggest we keeping the current design on this department.

2. Implement the option for python to execute in threads.

Currently in both py4cl and py4cl2, python executes the code synchronously. It is fine when the computation time is short. However, it is not optimal for more computationally complex requests e.g. (py4cl2:pyexec "import time; time.sleep(10)").

I suggest the following model for parallel execution:

  1. Expose Lisp the option to make an async execution.
  2. Upon receiving an async execution request, python creates a random FIFO, return the FIFO path to Lisp, and launch a thread that executes the code and returns results in that FIFO.
  3. In the meanwhile, that lisp call would listen to the FIFO, and return after reading from it.

Things get complicated here so please note the following: The design above is to make python not block, but the Lisp call still blocks! Namely, the Lisp call still has to wait for the computation of the python thread before it returns. That means if python takes 10 seconds for that execution, the Lisp call will not return at least for 10 seconds too.

You may ask why I choose this design given our goal being parallel execution. Well, I choose this because of orthogonality. Please note that while this still blocks on the Lisp side, it doesn't block on the python side. And it would be easy to make it not blocking on the Lisp side - simply fire this async call with a lisp thread!

digikar99 commented 1 year ago

3. Multiple python processes managed by lisp

So, this is another idea. We keep multiple python processes managed by lisp, similar to how lparallel maintains a kernel (thread-pool?), and lisp calls into whichever python process is free at that point of time. This feels doable with minimal changes to how the calls are handled.

I wonder if @bendudson had something along these lines in mind when the variables *python* and *current-python-process-id* were introduced.

The main issue with this approach would be if data needs to be maintained on the python side - but I believe even python's multiprocessing approach runs into similar issues, given my very naive experience of it the last time I had attempted it.

jcguu95 commented 1 year ago

Thanks for the feedback. We can achieve this quickly. However, there are two drawbacks that I’d like to address.

  1. I’m not sure about your use case. However, as far as I know many people use python to do data analysis. There, computations are often very long and come in multiple stages. That means it’d be handy if an intermediate result can be store somewhere in the process’a memory. With different python processes we can indeed achieve multiprocessing. But then the user still have to take care of the communication among python processes. The more inter processes communications, the less inefficient the total computations would be.

  2. Sometimes it takes time to load packages and required data into a python process. The overhead time could accumulate fast.

—-

Conclusion: I would say your proposal is a great idea. And we should definitely make it more easily to use (I can help do that too). But to make it general, we gotta allow multiprocessing in python. I know what to do, and if you align with that line, I can start implementing them this weekend. :)

BTW Thanks for your prompt responses and thanks for merging my PRs so quickly :D

bendudson commented 1 year ago

Thanks @jcguu95 and @digikar99 this is an interesting discussion. I was thinking a bit about multiple python instances with the async call (https://github.com/bendudson/py4cl#asynchronous-python-functions-python-call-async) but never thought it through carefully.

If multiple python instances are managed then the lisp caller needs to keep track of the state of each one, using a handle to represent each python instance. That complicates importing python functions or modules: when running (np:zeros 10) which python instance should be used? A more ambitious approach would be to try and implement a worker pool, but then efficiently managing where data is stored becomes complicated.

jcguu95 commented 1 year ago

Thank you @bendudson for raising another good point: The user should not care which underlying python process/thread to use, since most of the time they only want computation power. With the multiple python processes approach, the user has to do it manually. This can be handle more easily with a single python process with multiple python threads.

digikar99 commented 1 year ago

If multiple python processes approach is to be considered, indeed, something analogous to worker-pool would be the first attempt. Anything less than that would be inconvenient. The lisp-caller will need to keep a track of it, but the user won't be required to.

In large part, I have used py4cl/2 to operate on lisp-data, with the python process only acting as temporary storage. However, if users do use the python process as a means of primary storage, then certainly the multiple processes approach would be a head-ache, and multiprocessing-within-python (or multithreading?) would be the way to go.

jcguu95 commented 1 year ago

Thanks for both of your inputs and ideas :) If our idea aligns, please confirm and I will start writing them this weekend. I appreciate your help and effort!

digikar99 commented 1 year ago

Being more familiar with lisp, I feel the multiple processes approach would be easier - and perhaps even better in performance - and perhaps lparallel might even have ready-made tools we can put to use.

But the multiprocessing-within-python would be better in terms of it covering the use cases where users use the python process as the primary storage and transfer minimal data to lisp.

So, I do feel conflicted about it; feel free to toss a coin or three to decide which method to give the first try.

jcguu95 commented 1 year ago

Indeed, two methods fulfill two different things. I know the tools on both sides, and can implement both. To avoid confusion, we can also write good documentations and usage examples afterwards. I believe py4cl2 deserves more attention then it currently has.

========================================================

EDIT I realized that there's a ambiguity in the idea

  1. Multiple python processes managed by lisp Namely, how much should the users care about the fact that there are multiple python processes?

(A) If the users should be fully aware, then it won't be hard to implement. We simply have to maintain a list of running python processes, and let the user to eval with a specific python process.

(B) However, it the users should not care about how many underlying python processes there are, I'd say it is very hard to achieve. For example, how should it behave when the users import a module? Should we import it in all python processes? For another example, when there's a python foreign object, should we share copies of it to other python processes? Or should the system be smart enough to talk to a specific python process for a specific foreign object?

Being more familiar with lisp, I feel the multiple processes approach would be easier

Handling multiple python processes with premise (B) is hard, and will involve many different assumptions. I feel because we are more familiar with lisp, we shouldn't go for premise (B).

digikar99 commented 1 year ago

If the multiple python processes approach is to be considered, I'd ideally like to go for premise (B). However -

how should it behave when the users import a module? Should we import it in all python processes?

For modules or functions defined using defpymodule or defpyfun, let it be imported in all the python processes.

Or should the system be smart enough to talk to a specific python process for a specific foreign object?

Talking to a specific python process would be an incredibly hard problem as far as I see.

when there's a python foreign object, should we share copies of it to other python processes?

Essentially, we could have a dynamically bound variable *current-python-process* (or an existing variable) which when bound to a process-object, code gets executed only in that process. However, raw-pyexec emits a warning* that the code is being executed only in a single process. raw-pyeval should not have any trouble if raw-pyexec is handled correctly I think.

Otherwise, if *current-python-process* is bound to t, code gets executed in all the python processes.

And there will be another parameter *python-process-pool-size* that the user can set.

*Whether to emit a warning can again be controlled by another dynamically bound variable.

PS: There was some recent reddit discussion about PyFFI and equivalent in Racket and Scheme. Their seeming simplicity compared to burgled-batteries made me attempt a CFFI approach once more, and we have a very rudimentary py4cl2/cffi system working here. A lot needs to be done, but if performance is a concern for py4cl2, then this could be the way to go forward!

jcguu95 commented 1 year ago

Essentially, we could have a dynamically bound variable current-python-process (or an existing variable) which when bound to a process-object, code gets executed only in that process.

What if foreign objects X and Y are stored in python-process px and py respectively, and if later a certain computation needs both X and Y?

made me attempt a CFFI approach once more

Great! Thanks for sharing the link, but they did not seem to provide much insight in the comment, except a link to a paper. I'm interested in reading it. I've also seen that it's much faster to use CFFI, and are very excited about your profiling result.

However, I will be very occupied for the Thursday and Friday, so I would have to postpone reading it to the weekends. (EDIT I read it, and watched two of their online talks.) If you are willing to, would you share some of what you learned from that. In particular, for example,

  1. In which way it is simpler than burgled-batteries? What didn't work for you when you tried burgled-batteries?
  2. Roughly how much work needs to be done in pyc4l/cffi?
  3. How do they handle parallel executions, and objects sharing like I mentioned above in this post? (EDIT I read through their paper and watched two online talks. It doesn't seem that they take care of this issue. However, I've also asked them on reddit (the link you posted). Hopefully they could provide some insights about this.)

.. etc.

Thank you very much 😃

digikar99 commented 1 year ago

foreign objects X and Y are stored in python-process px and py respectively

I had initially thought that the only way such objects can exist is if raw-pyexec was called, and for this we were going to issue a warning. However, turns out that unknown lisp objects generated by raw-pyeval will also be amongst these for which disambiguation would be necessary. So, this multiple python processes approach does seem like a dead-end except in the most trivial cases.

jcguu95 commented 1 year ago

I’m afraid so, yes.. I’m not worried about implementing threads in python. So should we go with 1+2?

Before we move on, I’d also like to hear more about your idea on CFFI and b-batteries. We certainly don’t want to re-invent what batteries had already.

digikar99 commented 1 year ago

If you are willing to, would you share some of what you learned from that.

I myself haven't read the paper yet. I shared it in the hopes you (or a future passerby on this thread) finds it interesting. There might also be equally interesting resources a few hops away, eg: the PyFFI racket library itself, or the gambit-scheme equivalent. But, also

  1. How do they handle parallel executions, and objects sharing like I mentioned above in this post?

I'll wait for u/belmarca to reply, although I don't think there is any trivial way to handle python-(true)multithreading or multiprocessing using the CFFI approach. Doing so seems like the equivalent of coming up with a cpython version without GIL, which I assume is problem people have tried to solve for ages.

  1. In which way it is simpler than burgled-batteries?

At this point in time, I am unable to load burgled-batteries. I do not understand cffi-grovel. But I did find a nice article about embedding python in C programs which I found relatively straightforward to translate to the lisp world - and the result is py4cl2/cffi.

  1. Roughly how much work needs to be done in py4cl2/cffi?

For someone familiar with C, python, and lisp, and willing to dig into the CPython's CAPI, and especially with nicer articles such as the last one, I'd estimate one will need 2-4 dedicated weeks aka 75-150 hours to get py4cl2/cffi to the level of current py4cl2.

jcguu95 commented 1 year ago

Threading

I don't think there is any trivial way to handle python-(true)multithreading or multiprocessing using the CFFI approach.

belmarca did reply my first comment on reddit in that thread, but not second one. We can still wait for them to confirm, but for now I'd pause and really think through what options we have in terms of multiprocessing.

Their craft allows Gambit Scheme and Python talk to each other back and forth (e.g. you can do something like (let ((x 3)) (py-call "1+\x")) ;; => 4, whereas in the py4cl model is seems that CL is dominant while python is inferior (please correct me if I'm wrong). If we want to adopt this, then we would need to understand what they do in terms of threading (~half to 1 page in their paper).

burgled-batteries

At this point in time, I am unable to load burgled-batteries.

Me either.. but I think it makes more sense to understand what it does first, if we really want to adopt CFFI to prevent reinventing the wheels. Perhaps, in the end, we will also need to use cl-grovel..

CFFI and py4cl2

I'd estimate one will need 2-4 dedicated weeks aka 75-150 hours to get py4cl2/cffi to the level of current py4cl2.

I'm a bit confused here. If all we need for the CFFI method is a faster way for CL and python to communicate, can't we just make that bridge with CFFI, and leave the rest untouched? I mean, that way it won't take long to make that bridge, and we can still leverage what you and @bendudson have done. Am I missing anything?

digikar99 commented 1 year ago

I have nothing to add to the sections on Threading and burgled-batteries, I agree with you.

About cffi and py4cl2: My current understanding is that there are four ways of calling python in the CFFI approach.

  1. Start an embedded python REPL
  2. Run a python file
  3. eval/exec arbitrary lines of code
  4. Call a python function or manipulate python objects.

The first 2 do not do anything useful for our task. The third is useful. However, if there is an error, one needs to capture that error and show the traceback. If something is printed, one needs to look into file descriptors to see where the output is exactly going. And perhaps there are other issues as well.

3 alone can boost the speed by 5-10 times. But it doesn't look too hard to extend things to do 4, and that gets another boost of 5-10 times, completely avoiding the use of eval as well as stream communication. At this point, all that is exactly happening are function calls and object allocation/translation. And we could provide an option to turn off translation using the equivalent of (with-)remote-objects.

jcguu95 commented 1 year ago

Thanks for the feedback. I think I understand a little bit more about the CFFI approach.

Essentially, it lets us embed a python in our lisp. From that, we can either do "shallow integration" and "deep integration".

Shallow Integration

By "shallow integration", I mean that python is merely an embedded process in lisp. What's great here is that any communication (via strings, still) does not have to go between two system processes, and therefore could be better (e.g as you did in the following).

(defun raw-py (code-string)
  (with-foreign-string (str code-string)
    (foreign-funcall "PyRun_SimpleString" :pointer str :int)))

This way, the only thing we will change essentially is the communication method. We can still imagine that we have two different processes, communicating via string under a certain protocol. And this should mean that we do not have to spent much time working as you mentioned above

75-150 hours to get py4cl2/cffi to the level of current py4cl2.

Deep Integration

This blurs the boundary between python and lisp. They no longer have to communicate via strings as if they are two processes, but there are low level bridges that convert lisp data types to python data types. I believe this is what burgled-batteries did, but I wonder if that's necessary - it seems to make the design much more complicated and less modular. To me, what I really am hoping for are a simple protocol and a communication-style agnostic way for two languages to communicate. Going for deep integration means that we need to write more CFFI codes, which makes the protocol heavier and mixes the protocol module and communication-style module together.

Burgled-batteries

I mentioned that we have to understand what BB is doing before adopting to the FFI method. However, as I mentioned above, if we are down for shallow integration, we do not have to. And if we are going for deep integration, not only do I think it's much more complicated, I do think it's a little bit pointless since that's what BB (or BB3) have been trying to achieve. In that case, it'd make more sense to me to fix and extend BB, instead of reinventing everything on our own.

In the hopefully unlikely condition where we decide to go for BB, there are at least some cases to fix:

  1. CFFI-grovel seems to be a mess. I've read from several places how unreliable it is. Even BB's main asd file mentioned that - it also provides a way to turn off the dependency on grovel. Maybe you'd want to try.

  2. Since grovel doesn't work for me out of the box, and I don't know how to fix, I turned it off. But then, the guess.lisp file gets loaded, which only supports x86 but not darwin (I'm on macos).

digikar99 commented 1 year ago

Given the discussion here, it seems burgled batteries aimed for an even "deeper" integration.

My current understanding and a little experience playing around is that the jump between the "shallow" integration and "deep" integration is small. Getting the shallow integration working seems like the harder part, as well as the jump between "deep" and "deeper". For instance, less than 10 hours in, I am able to translate (by value) integers, floats, tuples/lists, lists/vectors very easily. The code in pythonizers.lisp and lispifiers.lisp is fairly straightforward, compared to the code in python-process.lisp and callpython.lisp.

jcguu95 commented 1 year ago

Thanks for pointing me to the link to the discussion.


Getting the shallow integration working seems like the harder part, as well as the jump between "deep" and "deeper".

The discussion you showed actually says there are a lot of gotchas if one wants to go for "deeper" integration. Translations of some data types maybe easy, but other are difficult (mentioned therein).


I'd like to clarify where you vision we are heading to.

CL & PY on an equal footing?

By "on equal footing" I really just mean to treat PY as an independent component that can also freely send requests to CL (e.g. how they did it for Gambit-Python FFI, linked above). Do you want CL and PY to stand on an equal footing? Or do you want PY to maintain "inferior" as it is currently designed?

(I prefer equal footing, as it makes the final product more flexible.)

Maintain multiple ways of communication?

Here we already have 3 different ways for CL and PY to communicate:

  1. unix pipe (current design)
  2. socket, http server.. etc
  3. intraprocess communication (with CFFI and embedded python)

Do you want to maintain all of them, or are you certain that we will definitely pick some of them?

(I prefer maintaining all of them, partly because we don't know which one is the best solution yet, and partly because this will force us to separate the component of communication and messaging protocols.)

digikar99 commented 1 year ago

I'd like to clarify where you vision we are heading to.

I won't say I have a very clear vision where this will go. The primary thing I'd like to get working is being able to use "functions" from python libraries with minimal overhead, so that we can use numpy, tensorflow, matplotlib, scipy, and perhaps other similar libraries from python. What I have no current plans for is the ability to integrate the CL and python classes - this eliminates our ability to use something like PyTorch. I haven't much touched anything beyond this scientific computing ecosystem in the python world. Perhaps a few webscrapping libraries like beautiful-soup should also work.

I do not know what will work, what won't, what will be easy to get done, what won't be. I'm simply hoping to discover them as I try them. Certainly, there will be some things we can learn from the successes and failures of other projects. If something makes me think "this should obviously be done this way? why is it not being done this way?" and it is trivial to implement them, I implement them. If it is too lengthy/tedious to be implemented that way right away, depending on how badly I/someone needs it, I might still implement it, or if there's someone around to ask, I might ask. If a feature is doesn't look trivial/quick to implement and is just "nice to have, but not currently needed", I might leave it as a TODO until someone demonstrates a "I need it" use case.

equal footing vs inferior-python

I would be keeping the current design. cl4py is the complementary project (and the project that inspired py4cl!) that provides lisp libraries to python. I fail to see anything obvious and especially anything portable across different CL implementations (at least SBCL, CCL) that could provide a equal integration between lisp and python.

Do you want to maintain all of them, or are you certain that we will definitely pick some of them?

If the CFFI approach reaches the feature-parity of the existing py4cl2, I will put py4cl2 in maintenance mode fixing bugs if and when someone reports them and focus my time on py4cl2/cffi. In case py4cl2/cffi never reaches the feature parity, then both will continue to be developed side-by-side.

burgled-batteries

I attempted to load by disabling groveling. It still ran into an error, suggesting I might have to try with an older SBCL release. I finally managed to load it using the SBCL 1.4.5.debian provided with Ubuntu 18.04. It seems that much of the heavy lifting in burgled batteries is happening in ffi-interface.lisp. It provides a lot many functions and macros. My current CFFI approach has sidestepped this by directly using the bare CFFI functions. I do not know if I will need the intermediate layer - I'd prefer avoiding it if possible, or try relying on cl-autowrap if I do find it necessary.