uogbuji commented 1 year ago

@ChocolateEinstein and I were looking into Python async invocation via Langchain to an Ooba back end today. Immediate problem as an attempt to use LC's llm.agenerate coroutine seemed to just be hanging when I tried to use it fully async.

I know there are tons of limitations and points of brittleness around LC's async capabilities, but honestly, I can't contemplate any other way of writing LLM-based code.LC supports async for OpenAI via Python's openai lib, but that relies on OpenAI's own capabilities (batch prompt processing, streaming, etc.) I had a hunch the emulation in Ooba's OpenAI extension wouldn't quite be up to all that, and a conversation with @matatonic seems to confirm this, though mata seems willing to put some work into addressing this (with reasonable limitations).

A few key tidbits from that exchange:

The Ooba server will block in generate_reply() via LC async. In otehr words it will "block + queue on each call to generating functions right now" (which seems to match what I was seeing) - "but some calls can be async like embeddings; risk of API timeouts really quickly (~30-60sec) if there's too much queuing"
- This of course would render async useless, and matches my observations
HTTP 202 (batch/detached processing) might come in the mix at protocol level, but this is not yet implemented in Ooba
"Running, say, 5 requests, batched, through an LLM together is way faster than running the LLM 5 times - because memory transfer is the key bottleneck" — I suspect this is why OpenAI's async API is list-of-prompts by default
- OpenAI supports batched generation, which Ooba doesn't, yet. There's a PR to at least give a good error message for attempts to use it.
- I'll try to help test this error case.
- OpenAI also supports batched generation of pre-tokenized (tiktoken) vectors, which is tricky for Ooba, but is on the radar
- OpenAI also supports variation generation, which Ooba doesn't, but would be great to have
Ooba has some of the instrumentation for batched generation, but there is no interface for it. Beam search might be using batching behind the scenes
Some of this work in Ooba can proceed once mata completes his work on a test suite

Separate note:

The relevant python/openai code is a doozy! Voluminous & barely commented.
The LC async examples are the usual toy fare—no serious demo of how you would e.g. schedule LLM requests alongside other coroutines & successfully get & use the results in such cases

Workarounds until Ooba is async-ready

Meanwhile, I think what we'll look into is a multiprocess implementation that serializes access to the LLM hosted in Ooba. The front-end API will use some sort of message queuing or ticketing system to interface with the back end, while not blocking (so sleep/wake/poll-for-results pattern). This we can implement even if the Ooba API is strictly synchronous (i.e. blocks), and would allow us to build out in a way that enables easily dropping in async Ooba API access when that's ready.

matatonic commented 1 year ago

streaming has always been supported (if that matters)
HTTP 202 is a red herring, I was just trying to understand what you were after, it's not part of openai that I'm aware of, and there is no plan to implement anything like it
the ooba api can accept multiple simultaneous requests (it's based on the basic threaded http server), but any call to generate_reply (the core function for all text generation) will block until it's done generating the whole message - so far, and this is new behaviour, previously it would just try to generate again and OOM the GPU horribly. This is of course moot if you plan a significant load on the server though, the openai client will timeout after about 30-60 seconds of waiting for data so your plan of a client side queue (around the openai api client) is almost certainly going to be required - that or a load balancer in front of a large enough cluster of ooba+api.
I don't really know langchain at all yet, so it may be possible to enable this kind of functionality (I have no idea)

uogbuji commented 1 year ago

Thanks, @matatonic for the clarifications. My main aim was to capture all the facets of the conversation, but anything that narrows down the work to be done is great.

This morning I put together the workaround I mentioned above. Here is an example, all as one module. My next step will be to abstract the multiprocess/executor bits into the OgbujiPT library, for easy invocation by users. Async will become the preferred way to invoke OgbujiPT, since, frankly, it's the right way to do it. After that we can be discussing some of the relevant bits on the Ooba server side to chip away at.

@ChocolateEinstein, let me know how this code works for you.

import os
import asyncio
import concurrent.futures

from langchain import OpenAI

from ogbujipt.config import *
from ogbujipt.model_style.alpaca import prep_instru_inputs, ALPACA_PROMPT_TMPL

API_HOST = 'http://192.168.1.10'  # Points to local machine running Ooba + OpenAI API
os.environ['OPENAI_API_BASE'] = f'{API_HOST}:{DEFAULT_API_PORT}/v1'

# Set up the API connector
llm = OpenAI(temperature=0.1)

# Set up the prompt
# Note: Python's input keyword probably won't play well here 😉
msg = 'Good morning. How are you?'

instru_inputs = prep_instru_inputs(
    'Translate the following message to French',
    inputs=msg
    )

prompt = ALPACA_PROMPT_TMPL.format(instru_inputs=instru_inputs)

# Set up the async LLM invocation
def call_llm(prompt):
    '''
    Actual LLM request, to b executed in a separate process
    '''
    return llm(prompt)

# Could probably use something like tqdm.asyncio, if we wanted to be fancy
async def indicate_progress(pause):
    '''
    Progress indicator for the console. Just prints dots.
    '''
    while True:
        print('.', end='', flush=True)
        await asyncio.sleep(pause)

async def main():
    '''
    Schedule one task to do the long-running/blocking LLM request, and another
    to run a progress indicator in the background
    '''
    loop = asyncio.get_running_loop()
    executor = concurrent.futures.ProcessPoolExecutor()
    main_task = loop.run_in_executor(executor, call_llm, prompt)
    tasks = [
        main_task,
        asyncio.create_task(indicate_progress(0.5))
        ]
    done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
    print('Result: ', next(iter(done)).result())

# Important: if using multiprocessing via ProcessPoolExecutor
# in the main module, its entry point must be guarded, since the child Python processes will re-import it
# https://docs.python.org/3/library/multiprocessing.html#multiprocessing-safe-main-import
if __name__ == '__main__':
    # Launch the asyncio main loop, but only from the main module initial invocation
    asyncio.run(main())

uogbuji commented 1 year ago

A simpler version of the above is now possible, with the newly implemented ogbujipt.async_helper.schedule_llm_call(). See added demos/alpaca_multitask_fix_xml.py

OoriData / OgbujiPT

Look into asyncio client approach; add scaffold if this is not possible from upstream libraries #4

Workarounds until Ooba is async-ready