dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
8.33k stars 425 forks source link

Add Outlines model class for outlines-enabled remote API #597

Open davidsyoung opened 8 months ago

davidsyoung commented 8 months ago

What behavior of the library made you think about the improvement?

I have just started to use Outlines, and my use case is that I am hosting a local model on a server using Serve with vLLM.

Once I had the model being served correctly, I looked for a way to connect the outlines python package to this server.

While tooling is excellent with running local models directly, there is no current adapter for connecting to an "Outlines-enabled" API, and instead it means that I have to convert my desired prompt/behaviour to JSON/Regex before sending to the server.

This reduces the advantage of the outlines package.

How would you like it to behave?

While the current way to call models is something like this...

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")
generator = outlines.generate.text(model, max_tokens=100)

result = generator("What's 2+2?")

print(result)
# That's right, it's 4! But remember, a delicious and nutrient dense 4,
# according to YEARS BUILT ON SOLID SCIENCE. This column presents additional
# findings from the fifteen-year study that produced the 2+2=4 conclusion.

I see the new implementation as a drop in replacement for the transformers model (or any other):

model = outlines.models.remote("http://<ip address>:<port>/")
generator = outlines.generate.text(model, max_tokens=100)

result = generator("What's 2+2?")

print(result)
# That's right, it's 4! But remember, a delicious and nutrient dense 4,
# according to YEARS BUILT ON SOLID SCIENCE. This column presents additional
# findings from the fifteen-year study that produced the 2+2=4 conclusion.

Implementation wise, you will know the best approach based on how the library functions, but in my mind theres two ways to implement:

  1. Create an adapter that converts to Regex or JSON Schemaand uses the current /generate endpoint.

  2. Create a new remote endpoint that serialises local state, and then passes it to server state.

This functionality would really allow Outlines to be easier used for anyone that is using it in an API/server capacity instead of local.

Thank you in advance!

lapp0 commented 8 months ago

Thanks for the issue.

I think it makes sense to integrate vLLM into outlines.models so users can use outlines.generate functions with it. Furthermore, outlines.serve should be made generic, allowing users to host any model as an endpoint.

Sending logits back and forth via an http server creates an additional point of failure, introduces overhead, and complexity.

Would the ability to run the following satisfy your needs?

model = outlines.models.vllm("mistralai/Mistral-7B-v0.1")
generator = outlines.generate.text(model, max_tokens=100)

result = generator("What's 2+2?")
davidsyoung commented 8 months ago

Thanks for this @lapp0. Really appreciate your time.

The main issue I would see with above is that typically (or at least with my use-case), is that it would be hosting vllm in a separate server as a backend for multiple different workflows.

The client would stay as a "thin" client then, and only interfacing with the server.

I do also understand your perspective at the same time. In an ideal world, I would see it very similar to the OpenAI client, with of course the additional generation capabilities that outlines allows.

CiANSfi commented 8 months ago

I would also benefit from the use case @davidsyoung describes. +1

Alternately, exposing more inference engine features on the integration side would help. For instance, in trying to cram a specific exl2 model into my GPU's VRAM when running it in native exllamav2/exui, I can simply change the cache mode to FP8 and everything runs fine. When trying to start up a model in outlines via outlines.models.exllamav2, I cannot control that setting.

Of course, if I am already able to do this (possibly via some kind of kwarg?) and simply don't know how to, please correct me! @lapp0

lapp0 commented 7 months ago

@davidsyoung Your use case makes sense, it's reasonable to expose outlines as an API. You would need to start vLLM through Outlines on your server, and Outlines would need to be updated to support a REST API. For now the best you can do is write an HTTP server wrapping outlines.generate calls.

@CiANSfi you are correct, only ExLlamaV2Cache is supported by outlines, not ExLlamaV2Cache_8bit. Could you open a separate issue for that please? Seems like a relatively simple fix.