andrewyng / aisuite

Simple, unified interface to multiple Generative AI providers
MIT License
1.6k stars 169 forks source link

Request for Python Asyncio Support #61

Open BobbyL2k opened 6 hours ago

BobbyL2k commented 6 hours ago

I would like to request support for Python’s asyncio in this library. This feature would be particularly beneficial for Python services, which often rely on asynchronous programming for efficient and scalable operations.

Some providers, such as OpenAI, already offer native async support (e.g., from openai import AsyncOpenAI), making it straightforward to wrap these APIs. Others, like AWS, have community-supported async wrappers, such as aioboto3. For providers without async support, an interim solution using a synchronous wrapper could be implemented while awaiting a proper asyncio implementation.

Asyncio support would greatly enhance the usability of this library. Thank you for considering this enhancement.

sarthakforwet commented 4 hours ago

Can you please assign this issue to me?

soulcarus commented 4 hours ago

I refactored the code to use a thread pool instead of asyncio.

Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.

With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.

For more details, you can check the full implementation here: Pull Request #64.

soulcarus commented 4 hours ago

Oh, i also wrote a full-document (1 page) in wich i explain why I THINK it is better to use threads rather than asyncio in this case

https://docs.google.com/document/d/17kESXXEUkA0gwc6qksFnZ2i5IjCP3Nk7-CH6sjnsgIE/edit?usp=sharing

oraclesystem commented 4 hours ago

looks good and right, congrats

i would approve this PR but only with more changes and detail on it

soulcarus commented 4 hours ago

Thanks! I also think it's the better approach

About the PR, yes, i also said it in the description; If the contributors think it's ok, i'll make it more self-explanatory and finish the feature

thanks for the feedback ;)

chiyiliao commented 3 hours ago

When there are 1000~10000 requests at the same time, which one will perform better, the thread architecture or the asyncio architecture?

soulcarus commented 3 hours ago

Handling 10,000 simultaneous requests can indeed approach the scale of a DDoS for some infrastructures, depending on their capacity and setup. However, if your system can handle this volume without triggering any limits, asyncio would likely be the better choice in terms of efficiency and scalability for managing such high concurrency.

My proposal for a thread-based solution was designed with smaller-scale scenarios in mind as an initial improvement. For example, if you are working with 30 models, this approach can process responses in approximately 3 seconds on average instead of waiting for each model to return sequentially, which would take around 90 seconds.

While an asynchronous client implementation has already been developed by someone else—providing a great solution for large-scale use cases—I opted for a threading approach to achieve significant performance gains with minimal effort and complexity. For smaller workloads or as a stepping stone toward further optimization, threads strike a practical balance between simplicity and performance.

If you look into the code, you can see that even all the test cases are working, because i added almost nothing to the code, just threaded the processment.

soulcarus commented 3 hours ago

Take a look at this: I made 49 requests, and they all returned within 8 seconds. Here's the kicker—I'm in Brazil, where we don't have any OpenAI API servers nearby. Despite this, the solution scales efficiently within this range, handling 49 expensive requests simultaneously without any noticeable bottlenecks

CODE image OUTPUT image

video: https://drive.google.com/file/d/17wbfVsZnvVPSKumtsj63qS7srTSYLL82/view?usp=sharing

BobbyL2k commented 1 hour ago

I refactored the code to use a thread pool instead of asyncio.

Initially, I attempted an asyncio-based solution. However, implementing a feature that solely uses asyncio would have required modifying several lines of code, which would have been time-consuming and inefficient for this specific task.

With just over 30 additional lines of code, I implemented a method that handles the heavy lifting by assigning each model inference to a separate thread. This change results in a performance improvement, reducing execution time by approximately 40% to 60%.

For more details, you can check the full implementation here: Pull Request #64.

Oh, i also wrote a full-document (1 page) in wich i explain why I THINK it is better to use threads rather than asyncio in this case

https://docs.google.com/document/d/17kESXXEUkA0gwc6qksFnZ2i5IjCP3Nk7-CH6sjnsgIE/edit?usp=sharing

Thank you for taking the time to address this issue and for providing changes in Pull Request #64, which proposes dispatching multiple requests in parallel using ThreadPoolExecutor. While this approach offers a way to parallelize tasks, it doesn’t align with the needs of library users requiring an asynchronous interface.

The purpose of an asynchronous interface is to enable seamless integration with other asynchronously executing code, especially for I/O-bound operations. For example, in scenarios where multiple consumer requests hit a backend and each requires a call to a chat completion API, users often do not have a batch of requests to parallelize. Instead, they rely on the non-blocking nature of async operations to manage such tasks efficiently. This is a fundamental use case that the current solution in Pull Request #64 does NOT address.

I would also like to respond to points made in your accompanying document:

“However, for tasks that block the thread (like network calls to AI APIs), the performance gain with asyncio is limited because the asynchronous execution model is not as effective in these cases.”

This statement is incorrect. Network calls are inherently I/O-bound and benefit significantly from asyncio's non-blocking model. In contrast, the current synchronous implementation would block the CPU during I/O operations, causing the async program or service to halt, undermining its responsiveness.

“Asyncio: Working with asyncio requires a deeper understanding of the event loop, asynchronous task creation, and exception handling. If not configured correctly, using asyncio can introduce additional complexity and hard-to-debug errors, especially in applications requiring true parallelism.”

While true, this complexity is why the library itself should handle the implementation of async interfaces.

“Threads can execute multiple tasks simultaneously.”

This is generally accurate for many languages, but Python's Global Interpreter Lock (GIL) imposes significant limitations. The Global Interpreter Lock prevents multiple threads from executing Python bytecode concurrently, which reduces the effectiveness of threads for CPU-bound tasks.

In conclusion, while the use of ThreadPoolExecutor may improve performance in certain contexts, it is not an appropriate solution for this issue. An asynchronous implementation is required to serve the needs of library users writing Asyncio Python.

soulcarus commented 1 hour ago

Now I understand your point, and I absolutely agree. It is indeed possible, and I’m willing to adapt the implementation accordingly.

When I said:

“However, for tasks that block the thread (like network calls to AI APIs), the performance gain with asyncio is limited because the asynchronous execution model is not as effective in these cases.”

My intention was to advocate for hosting and using locally saved models, which aligns with my area of expertise in the market—leveraging local computational power. However, I realize now that my wording may have caused some misunderstanding. I sincerely apologize for this and will take the opportunity to rewrite and clarify my thoughts in the morning (it's currently 4 a.m. here).

A truly asynchronous design allows handling multiple consumers requests without blocking the main flow, while using threads might introduce bottlenecks or overhead in high-concurrency situations. but, as i said, i was envisioning an idea to the lower-range user to extract the most performance (and it is faster now)

Regarding the Pull Request #62, I believe it already addresses the intended purpose. That said, I plan to refine my PR further and make it more suitable for the broader needs of the library.