How to know when LLM is done with reply?

dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.

https://dusty-nv.github.io/NanoLLM/

MIT License

197 stars 31 forks source link

How to know when LLM is done with reply? #12

Open ShawnHymel opened 5 months ago

ShawnHymel commented 5 months ago

In e.g. web_chat.py, you have the following callback:

    def on_llm_reply(self, text):
        """
        Update the web chat history when the latest LLM response arrives.
        """
        self.send_chat_history()

From what I can tell, this is called each time the LLM generates a token as part of a response to a prompt. How can you tell when the LLM is done generating tokens for a given prompt? Or should I set a simple timeout (e.g. "if no tokens generated in 0.5 sec, send 'done' signal").

dusty-nv commented 5 months ago

Hi @ShawnHymel! The EOS stop tokens like </s> are included in the raw bot output. So you can do quick checks like this:

from nano_llm import StopTokens

if text.endswith(tuple(StopTokens)):
    print('EOS')

if any([stop_token in text for stop_token in StopTokens]):
   print('EOS')