Closed rasbt closed 1 week ago
Is this what you were looking for @williamFalcon ?
nice!
can you test it with this Studio https://lightning.ai/lightning-ai/studios/deploy-a-private-llama-3-8b-api
It should work now @williamFalcon .
I modeled the adjusted litgpt serve
, which optionally uses the stream=True
option in generate
after this example in LitServe
. Since I am not super familiar with LitServe, could you double-check that this is indeed the correct use @aniketmaurya and @lantiga ?
Many thanks for the feedback @aniketmaurya ! Does the current code look better?
@carmocca That's a good question, I am not sure why I was so fixated on generate
. Let me see (in a separate PR) if I can make this serving example work with chat
, then we don't need to modify generate
.
Thanks for the feedback, @aniketmaurya !
Also please take a latest pull of LitServe, last week we merged some PR which affects streaming.
I could try that internally for testing but we'd probably have to wait until the next release. But could you ping me when the latest LitServe comes out so that I can bump the version in LitGPT?
Thanks for all your help with LitServe so far @aniketmaurya . It's probably something stupid on my part, but with the recent LitServe version from GitHub main somehow it's still not working right
I.e. with
litgpt serve --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream false
response = requests.post(
"http://127.0.0.1:8000/predict",
json={"prompt": "Fix typos in the following sentence: Exampel
input"}
)
The response.content
comes out as
--stream true
b'{"output": " file"}{"output": ":"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "7"}{"output": "."}{"output": "idx"}{"output": "."}{"output": "gz"}{"output": ":"}{"output": " error"}{"output": " opening"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}'
whereas --stream false
works as expected via LitServe (pls ignore the bad Pythia response, I just used a small model here for demoing:
b'{"output":"Fix typos in the following sentence: Exampel input.\\n\\nThis book is now available as a Kindle app.\\n\\nPlease go to the print page to view the rest of this book.\\n\\n## **Contents**\\n\\n**1** The Problem\\n\\n**2** How We Lear"}'
I think it's something in my LitServe implementation because the generate
with stream=True
works as intended (e.g., via litgpt generate base --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream true
)
If you have some time to take another look, that'd be really appreciated.
Hi @rasbt , we will have to iterate through the response in streaming mode like:
import requests
url = "http://127.0.0.1:8000/predict"
resp = requests.post(url, json={"prompt": "Fix typos in the following sentence: Exampel input"}, headers=None,
stream=True)
for line in resp.iter_lines():
if line:
print(line.decode("utf-8"))
when we don't iterate and use response.content, requests will concat everything in a bytestring.
That makes sense! Thanks for your patience here @aniketmaurya , I know almost nothing about web stuff and serving. The code you shared does make sense, but yeah, it's still only generating a single token. Something must still be incorrect somewhere.
However, I also wonder if there's maybe a LitServe issue. I tried a simpler example:
import litserve as ls
import json
class SimpleStreamAPI(ls.LitAPI):
def setup(self, device) -> None:
self.model = lambda x: str(x)
def decode_request(self, request):
return request["input"]
def predict(self, x):
for i in range(10):
yield self.model(i)
def encode_response(self, output_stream):
for output in output_stream:
yield {"output": output}
if __name__ == "__main__":
server = ls.LitServer(SimpleStreamAPI(), stream=True)
server.run(port=8000)
and then via
import requests
url = "http://127.0.0.1:8000/predict"
resp = requests.post(url, json={"input": "1, 2, 3"}, headers=None, stream=True)
for line in resp.iter_lines():
if line:
print(line.decode("utf-8"))
it just returns ["{\"output\": \"0\"}"]
, which doesn't seem right. I would expect it to return 10 things due to the for-loop in predict, or am misunderstanding this?
EDIT: Nevermind, I pulled the latest LitServe version and it's fixed!
Ok great, this works flawlessly with the new LitServe version now:
url = "http://127.0.0.1:8000/predict"
resp = requests.post(url, json={"prompt": "Hello world"}, headers=None, stream=True)
for line in resp.iter_lines():
if line:
print(line.decode("utf-8"))
{"output": "\n"}{"output": "\n"}{"output": "It"}{"output": " seems"}{"output": " that"}{"output": " I"}{"output": " can"}{"output": "'t"}{"output": " load"}{"output": " my"}{"output": " php"}{"output": " page"}{"output": " from"}{"output": " my"}{"output": " html"}{"output": " file"}{"output": "."}{"output": "\n"}{"output": "\n"}{"output": "A"}{"output": ":"}{"output": "\n"}{"output": "\n"}{"output": "1"}{"output": ")"}{"output": " Check"}{"output": " this"}{"output": " link"}{"output": ","}{"output": " it"}{"output": " might"}{"output": " solve"}{"output": " your"}{"output": " problem"}{"output": "."}{"output": "\n"}{"output": "2"}{"output": ")"}{"output": "Check"}{"output": " this"}{"output": " article"}{"output": "."}{"output": "\n"}{"output": "3"}{"output": ")"}{"output": " check"}{"output": " your"}{"output": " php"}{"output": " file"}{"output": " path"}
I now also got it to work with the chat version via #1426. I guess we can close this one then and leave the base generate function untouched.
Or alternatively, the question is if we should replace the generate
function chat/base.py
with this one here that can handle both streaming and non-streaming @carmocca ? In the end, it would be less code to maintain (only one instead of two generate functions).
I didn't merge them before because the logic to check for the stop tokens is not as performant as simply generating a fixed number of tokens. They could probably still be reconcilliated with some if-elses but code simplicity over deduplication was desirable before
Makes sense and no worries. I'd say let's keep it as is. I got the LitServe streaming example to work after all, so mission accomplished 😊
Fixes #1420