chaquo / chaquopy

Chaquopy: the Python SDK for Android
https://chaquo.com/chaquopy/
MIT License
807 stars 131 forks source link

Parallel execution of python code on multiple CPU cores #1251

Open HeikoWanzl opened 1 week ago

HeikoWanzl commented 1 week ago

I have a question regarding parallel execution of python code on multiple CPU cores:

We are working on an Android project that has the bulk of it's code in Python using Chaquopy. For performance improvements we would like to implement the pipeline components we have in Python to run truely in parallel, meaning on different CPU cores. The multiprocessing package doesn't seem to work with Chaquopy, so we are wondering what options we have to realize this.

Our initial approach was to create multiple threads in Java and execute Python code in each, but it seems to access the same instance, the second call even being able to change variables for the first. Is this actually executed in parallel and just running in the same variable space or as we assume still executed on the same core?

Is it just not possible to do this using chaquopy? Is there anything we are missing? Things we could try?

Thank you for your time!

mhsmith commented 1 week ago

Chaquopy only supports one instance of Python at a time per app. However, each instance does support multiple threads on multiple cores to exactly the same extent as Python on any other platform.

In particular, this means that the global interpreter lock (GIL) will only allow Python code to execute on one core at a time, and will switch between them regularly. Any time a Python thread calls native code (e.g. in NumPy), or Java code, or waits for a network or storage device, will also provide an opportunity for another thread to run. So whether the GIL causes an actual performance problem will depend on the details of your app.

HeikoWanzl commented 1 week ago

I'm having a hard time to figure out how to work around this. Let me bother you one last time. :)

We have a data processing pipeline implemented in Python (A-B-C-D->). A is a camera that creates the data to be processed, B/C/D can work independently, but take inputs from the previous component and give output to the next component. The earlier components are faster than the later ones, so we would benefit from letting the earlier components process new data while the later ones process the outputs of the earlier ones. Everything is run using one module.callAttr() call and runs forever. Currently the components all seem to take turns on the same core, as you described, so we are looking for ways to run them concurrently.

Would the only way to do this be to implement some or all of them in Java/C++? If some components remain in Python, what would be the best way to hand over the data between the components (for example if A, B and D are in Python, but C is in Java)?

Should we hand over the Java Object for component C at the time of the module.callAttr() call of the whole Python code or create the Java Object from Python using the API? Or should we hand over the data using something like sockets?

mhsmith commented 1 week ago

First you should verify that you really have a concurrency problem. For example, you could time running each of the components separately with some pre-generated data, and then time running them all at the same time. If running them all at the same time is not much faster than the sum of running them all separately, then you're not taking advantage of the multiple cores. But remember that no matter how concurrent your design is, the pipeline can't run any faster than its slowest component.

If some components remain in Python, what would be the best way to hand over the data between the components?

How are you handing it over now, in Python?

HeikoWanzl commented 1 week ago

For example, you could time running each of the components separately with some pre-generated data, and then time running them all at the same time.

That's a test we can do, thank you. Currently we time each component as part of the whole, but we're assuming that some of the components are affected by the scheduling more than others, so that isn't really conclusive.

How are you handing it over now, in Python?

Yeah, strictly in Python.

mhsmith commented 1 week ago

How are you handing it over now, in Python?

Yeah, strictly in Python.

That wasn't a yes or no question. HOW are you handing it over? You said you were running each component in a separate thread, so it can't be a simple function call.

HeikoWanzl commented 1 week ago

We're using threads and queues. Under Windows/Linux we're using multiprocessing and queues.

mhsmith commented 4 days ago

You can use the Chaquopy Java and Python APIs to access a Python queue from Java/Kotlin, or vice versa.