NVIDIA / GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
https://nvidia.github.io/GenerativeAIExamples/latest/index.html
Apache License 2.0
1.77k stars 294 forks source link

langchain_nvidia_trt not working #108

Open rbgo404 opened 2 months ago

rbgo404 commented 2 months ago

I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM. Here's the issue: image

Code used:

from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random

triton_url = "localhost:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

start_time = time.time()
tokens_generated = 0

for val in client._stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")
rbgo404 commented 2 months ago

Please share the configuration in the TensorRT-LLM end. What are the parameters modification required in the model's config.pbtxt

shubhadeepd commented 2 months ago

Hey @rbgo404 You can deploy the tensorRT-based LLM model by following the steps here https://nvidia.github.io/GenerativeAIExamples/latest/local-gpu.html#using-local-gpus-for-a-q-a-chatbot

This notebook interacts with the model deployed behind llm-inference-server container which should get started up if you follow the steps above.

Let me know if you have any questions once you go through these steps!

ChiBerkeley commented 2 months ago

Hi, I followed the instruction but still has problem starting llm-inference-server. I'm currently using Tesla M60 and llama-2-13b-chat Screenshot from 2024-04-30 23-08-17