Closed uogbuji closed 1 year ago
Hey @uogbuji i think we might be able to help here - https://github.com/BerriAI/litellm
I'm the maintainer of litellm - a drop-in replacement for the openai-python sdk that handles api calls for anthropic, azure, huggingface, togetherai, replicate, etc.
Hi @krrishdholakia thanks for your interest in what we're doing here! I like the bias to simplicity in litellm. I'd have to dig in a lot more, but just at first glance I was struck by this snippet:
response = completion(model="gpt-3.5-turbo", messages=messages, stream=True)
for chunk in response:
print(chunk['choices'][0]['delta'])
I'd expect that to be an async for
(or some equivalent construct), otherwise, it's not really streaming, I think, or at least not in a way that supports concurrency.
One of the biggest reasons we're rewrapping so much of this is to get true concurrency and (reasonable) isolation right. I'm definitely open to collaborations, so as I say, I'll try to get a chance to dig more into litellm to make sure it would suit our architectural imperatives. Unfortunately I'm heading into a block of travel. I'll be scarce in the 1st & 3rd weeks of October & scrambling to keep up in between, so if you don't hear back in a little while, that's probably why.
hey @uogbuji we support async streaming as well - https://docs.litellm.ai/docs/completion/stream#async-streaming
from litellm import completion
import asyncio
def logger_fn(model_call_object: dict):
print(f"LOGGER FUNCTION: {model_call_object}")
user_message = "Hello, how are you?"
messages = [{"content": user_message, "role": "user"}]
async def completion_call():
try:
response = completion(
model="gpt-3.5-turbo", messages=messages, stream=True, logger_fn=logger_fn
)
print(f"response: {response}")
complete_response = ""
start_time = time.time()
# Change for loop to async for loop
async for chunk in response:
chunk_time = time.time()
print(f"time since initial request: {chunk_time - start_time:.5f}")
print(chunk["choices"][0]["delta"])
complete_response += chunk["choices"][0]["delta"]["content"]
if complete_response == "":
raise Exception("Empty response received")
except:
print(f"error occurred: {traceback.format_exc()}")
pass
asyncio.run(completion_call())
Let me know if this solves your problem
Also open to suggestions on where in docs you were looking for this
With the latest:
from ctransformers import AutoModelForCausalLM
from ogbujipt.llm_wrapper import ctrans_wrapper
MY_MODELS = '/Users/uche/.local/share/models' # Salt to taste
model = AutoModelForCausalLM.from_pretrained(
f'{MY_MODELS}/TheBloke_LlongOrca-13B-16K-GGUF',
model_file='llongorca-13b-16k.Q5_K_M.gguf',
model_type="llama",
gpu_layers=50)
oapi = ctrans_wrapper(model=model)
print(oapi('The quick brown fox'))
Built ctransformers for my Mac as follows:
CT_METAL=1 pip install "ctransformers>=0.2.24" --no-binary ctransformers
i'm confused. this looks like you're calling local models. i thought the issue was for openai api calls?
Hi @krrishdholakia, I'm sure I can be offering more clarity on all this. I mentioned my upcoming travel. This particular burst of work is addressing an issue for a client, and I want to get some previously planned moves in place before I leave on Weds. Some of this work was part-documented in internal repositories you won't have seen.
That said, this ticket is about encapsulating LLM capability in general; OpenAI APIs are but one means of working with LLMs. We've always intended to support a selection of in-memory LLM loaders as well. This commit brings some work from a separate repository into OgbujiPT.
Thanks for answering my question about async support in litellm. I do plan to have a look, but again, as a priority I need to get some pre-discussed bits in place for my colleagues before my trip. I think you asked a question about litellm docs. I went purely by the README. Haven't had a chance to peruse the docs.
I'll definitely add some more context to the corresponding PR, needed for the changelog anyway.
Right now, again based on our initial, long obsolete Langchain orientation, we are managing the OpenAI API connection globally. To be fair, the openai library encourages such bad habits as well.
In addition to just having cleaner code, we also need to support multiple LLMs, for example someone could use different LLMs for different parts of an agent/tool interaction.
Enable all this by OO encapsulating LLM connections. This will also come in handy as we add support to connections not made via OpenAI API (e.g. LLMs loaded within a local process).