Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.95k stars 733 forks source link

Adds streaming option to generate #1424

Closed rasbt closed 1 week ago

rasbt commented 2 weeks ago

Fixes #1420

rasbt commented 2 weeks ago

Is this what you were looking for @williamFalcon ?

williamFalcon commented 2 weeks ago

nice!

can you test it with this Studio https://lightning.ai/lightning-ai/studios/deploy-a-private-llama-3-8b-api

rasbt commented 2 weeks ago

It should work now @williamFalcon .

I modeled the adjusted litgpt serve, which optionally uses the stream=True option in generate after this example in LitServe. Since I am not super familiar with LitServe, could you double-check that this is indeed the correct use @aniketmaurya and @lantiga ?

rasbt commented 2 weeks ago

Many thanks for the feedback @aniketmaurya ! Does the current code look better?

rasbt commented 1 week ago

@carmocca That's a good question, I am not sure why I was so fixated on generate. Let me see (in a separate PR) if I can make this serving example work with chat, then we don't need to modify generate.

Thanks for the feedback, @aniketmaurya !

Also please take a latest pull of LitServe, last week we merged some PR which affects streaming.

I could try that internally for testing but we'd probably have to wait until the next release. But could you ping me when the latest LitServe comes out so that I can bump the version in LitGPT?

rasbt commented 1 week ago

Thanks for all your help with LitServe so far @aniketmaurya . It's probably something stupid on my part, but with the recent LitServe version from GitHub main somehow it's still not working right

I.e. with

litgpt  serve --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream false

response = requests.post(
        "http://127.0.0.1:8000/predict",
         json={"prompt": "Fix typos in the following sentence: Exampel
         input"}
 )

The response.content comes out as

--stream true

b'{"output": " file"}{"output": ":"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "7"}{"output": "."}{"output": "idx"}{"output": "."}{"output": "gz"}{"output": ":"}{"output": " error"}{"output": " opening"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}'

whereas --stream false works as expected via LitServe (pls ignore the bad Pythia response, I just used a small model here for demoing:

b'{"output":"Fix typos in the following sentence: Exampel input.\\n\\nThis book is now available as a Kindle app.\\n\\nPlease go to the print page to view the rest of this book.\\n\\n## **Contents**\\n\\n**1** The Problem\\n\\n**2** How We Lear"}'

I think it's something in my LitServe implementation because the generate with stream=True works as intended (e.g., via litgpt generate base --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream true)

If you have some time to take another look, that'd be really appreciated.

aniketmaurya commented 1 week ago

Hi @rasbt , we will have to iterate through the response in streaming mode like:


import requests

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"prompt": "Fix typos in the following sentence: Exampel input"}, headers=None,
                     stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))

when we don't iterate and use response.content, requests will concat everything in a bytestring.

rasbt commented 1 week ago

That makes sense! Thanks for your patience here @aniketmaurya , I know almost nothing about web stuff and serving. The code you shared does make sense, but yeah, it's still only generating a single token. Something must still be incorrect somewhere.

However, I also wonder if there's maybe a LitServe issue. I tried a simpler example:

import litserve as ls
import json

class SimpleStreamAPI(ls.LitAPI):
    def setup(self, device) -> None:
        self.model = lambda x: str(x)

    def decode_request(self, request):
        return request["input"]

    def predict(self, x):
        for i in range(10):
            yield self.model(i)

    def encode_response(self, output_stream):
        for output in output_stream:
            yield {"output": output}

if __name__ == "__main__":
    server = ls.LitServer(SimpleStreamAPI(), stream=True)
    server.run(port=8000)

and then via

import requests

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"input": "1, 2, 3"}, headers=None, stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))

it just returns ["{\"output\": \"0\"}"], which doesn't seem right. I would expect it to return 10 things due to the for-loop in predict, or am misunderstanding this?

EDIT: Nevermind, I pulled the latest LitServe version and it's fixed!

rasbt commented 1 week ago

Ok great, this works flawlessly with the new LitServe version now:

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"prompt": "Hello world"}, headers=None, stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))
{"output": "\n"}{"output": "\n"}{"output": "It"}{"output": " seems"}{"output": " that"}{"output": " I"}{"output": " can"}{"output": "'t"}{"output": " load"}{"output": " my"}{"output": " php"}{"output": " page"}{"output": " from"}{"output": " my"}{"output": " html"}{"output": " file"}{"output": "."}{"output": "\n"}{"output": "\n"}{"output": "A"}{"output": ":"}{"output": "\n"}{"output": "\n"}{"output": "1"}{"output": ")"}{"output": " Check"}{"output": " this"}{"output": " link"}{"output": ","}{"output": " it"}{"output": " might"}{"output": " solve"}{"output": " your"}{"output": " problem"}{"output": "."}{"output": "\n"}{"output": "2"}{"output": ")"}{"output": "Check"}{"output": " this"}{"output": " article"}{"output": "."}{"output": "\n"}{"output": "3"}{"output": ")"}{"output": " check"}{"output": " your"}{"output": " php"}{"output": " file"}{"output": " path"}
rasbt commented 1 week ago

I now also got it to work with the chat version via #1426. I guess we can close this one then and leave the base generate function untouched.

Or alternatively, the question is if we should replace the generate function chat/base.py with this one here that can handle both streaming and non-streaming @carmocca ? In the end, it would be less code to maintain (only one instead of two generate functions).

carmocca commented 1 week ago

I didn't merge them before because the logic to check for the stop tokens is not as performant as simply generating a fixed number of tokens. They could probably still be reconcilliated with some if-elses but code simplicity over deduplication was desirable before

rasbt commented 1 week ago

Makes sense and no worries. I'd say let's keep it as is. I got the LitServe streaming example to work after all, so mission accomplished 😊