Performance issues with Guardrails

lauradang commented 1 year ago

I am noticing using guardrails is 3-4x slower than just querying with a langchain chain.

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.chains import RetrievalQA
import chromadb
from chromadb.config import Settings
from nemoguardrails import LLMRails, RailsConfig

model_name = "hkunlp/instructor-large"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceInstructEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
)

settings = Settings(chroma_api_impl='rest', chroma_server_host='host.com', chroma_server_http_port='80')
chromadb = Chroma(embedding_function=hf, collection_name='collection', client_settings=settings)

llm = get_llm('llm', max_tokens=256) # get_llm is a method to instantiate our custom langchain LLM

chain = RetrievalQA.from_chain_type(
    llm=llm,
    return_source_documents=True,
    chain_type='refine',
    retriever=chromadb.as_retriever(),
    verbose=True,
)

resp = chain({'query': 'Hello'}) # This line took 40.1s to complete

rails_config = RailsConfig.from_path(config_path)
rails_app = LLMRails(rails_config)
rails_app.register_action(chain, name='chain')

user_input = [{
    'role': 'user',
    'content': 'Hello'
}]

resp = rails_app.generate(messages=user_input) # This line took 1m45.5s to complete

Specifically this line: resp = chain({'query': 'Hello'}) vs. this line resp = rails_app.generate(messages=user_input) has very different runtimes. The one without guardrails took 40s to complete while the one with guardrails took 1m45s to complete.

I ran this experiment in a jupyter notebook, but I am also noticing this happening with FastAPI.

sidgan commented 1 year ago

If you run it with --verbose you can see the exact time each command took, that will help narrow down and optimize further.

lauradang commented 1 year ago

Thank you @sidgan. This is the log output as a result:

It seems the LLM calls are taking up a lot of time (40s, 10s, then 40s). Is there a way to have guardrails and not call the model so many times?

INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'UtteranceUserActionFinished', 'final_transcript': 'Hello'}
INFO:nemoguardrails.flows.runtime:Event :: UtteranceUserActionFinished {'final_transcript': 'Hello'}
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'StartInternalSystemAction', 'uid': 'b95d8f28-41fd-4884-b15a-5731a54d4b2f', 'event_created_at': '2023-10-13T03:02:26.525034+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_user_intent', 'action_params': {}, 'action_result_key': None, 'action_uid': 'acb89753-92fd-40b1-8647-fec3e6ebfa9d', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Event :: StartInternalSystemAction {'uid': 'b95d8f28-41fd-4884-b15a-5731a54d4b2f', 'event_created_at': '2023-10-13T03:02:26.525034+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_user_intent', 'action_params': {}, 'action_result_key': None, 'action_uid': 'acb89753-92fd-40b1-8647-fec3e6ebfa9d', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Executing action :: generate_user_intent
INFO:nemoguardrails.actions.action_dispatcher:Executing registered action: generate_user_intent
INFO:nemoguardrails.actions.llm.generation:Phase 1: Generating user intent
Batches: 100%|██████████| 1/1 [00:00<00:00, 28.30it/s]
INFO:nemoguardrails.logging.callbacks:Invocation Params :: {'model_name': ''model'', 'temperature': 0.0, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}, '_type': 'openai', 'stop': None}
INFO:nemoguardrails.logging.callbacks:Prompt :: """
Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it truthfully says it does not know.
"""

# This is how a conversation between a user and the bot can go:
user "Hello there!"
  express greeting
bot express greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "Tell me a bit about the history of NVIDIA."
  ask general question
bot response for general question
  "NVIDIA is a technology company that specializes in designing and manufacturing graphics processing units (GPUs) and other computer hardware. The company was founded in 1993 by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem."
user "tell me more"
  request more information
bot provide more information
  "Initially, the company focused on developing 3D graphics processing technology for the PC gaming market. In 1999, NVIDIA released the GeForce 256, the world's first GPU, which was a major breakthrough for the gaming industry. The company continued to innovate in the GPU space, releasing new products and expanding into other markets such as professional graphics, mobile devices, and artificial intelligence."
user "thanks"
  express appreciation
bot express appreciation and offer additional help
  "You're welcome. If you have any more questions or if there's anything else I can help you with, please don't hesitate to ask."

# This is how the user talks:
user "You are stupid"
  express insult

user "hi"
  express greeting

# This is the current conversation between the user and the bot:
user "Hello there!"
  express greeting
bot express greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "Hello"

INFO:openai:message='OpenAI API response' path=<...> processing_ms=None request_id=None response_code=200
INFO:nemoguardrails.logging.callbacks:Completion ::   express greeting
bot respond to greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based
INFO:nemoguardrails.logging.callbacks:Output Stats :: {'token_usage': {'total_tokens': 801, 'prompt_tokens': 544, 'completion_tokens': 257}, 'model_name': ''model''}
INFO:nemoguardrails.logging.callbacks:--- :: LLM call took 46.04 seconds
INFO:nemoguardrails.actions.llm.generation:Canonical form for user intent: express greeting
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'UserIntent', 'uid': '3e657e91-d4a8-4459-8c7a-3f6283efa1cc', 'event_created_at': '2023-10-13T03:03:12.607580+00:00', 'source_uid': 'NeMoGuardrails', 'intent': 'express greeting'}
INFO:nemoguardrails.flows.runtime:Event :: UserIntent {'uid': '3e657e91-d4a8-4459-8c7a-3f6283efa1cc', 'event_created_at': '2023-10-13T03:03:12.607580+00:00', 'source_uid': 'NeMoGuardrails', 'intent': 'express greeting'}
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'StartInternalSystemAction', 'uid': '7c646e20-fdf9-4522-9235-a5329117496c', 'event_created_at': '2023-10-13T03:03:12.608035+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_next_step', 'action_params': {}, 'action_result_key': None, 'action_uid': '48efa02d-a49d-4aa5-9030-c1aead011e5a', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Event :: StartInternalSystemAction {'uid': '7c646e20-fdf9-4522-9235-a5329117496c', 'event_created_at': '2023-10-13T03:03:12.608035+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_next_step', 'action_params': {}, 'action_result_key': None, 'action_uid': '48efa02d-a49d-4aa5-9030-c1aead011e5a', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Executing action :: generate_next_step
INFO:nemoguardrails.actions.action_dispatcher:Executing registered action: generate_next_step
INFO:nemoguardrails.actions.llm.generation:Phase 2 :: Generating next step ...
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.75it/s]
INFO:nemoguardrails.logging.callbacks:Invocation Params :: {'model_name': ''model'', 'temperature': 0.0, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}, '_type': 'openai', 'stop': None}
INFO:nemoguardrails.logging.callbacks:Prompt :: """
Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it truthfully says it does not know.
"""

# This is how a conversation between a user and the bot can go:
user express greeting
bot express greeting
user ask about capabilities
bot respond about capabilities
user ask general question
bot response for general question
user request more information
bot provide more information
user express appreciation
bot express appreciation and offer additional help

# This is how the bot thinks:
user express insult
bot express calmly willingness to help

# This is the current conversation between the user and the bot:
user express greeting
bot express greeting
user ask about capabilities
bot respond about capabilities
user express greeting

OpenAI
Params: {'model_name': ''model'', 'temperature': 0.0, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}}
INFO:openai:message='OpenAI API response' path=<...> processing_ms=None request_id=None response_code=200
INFO:nemoguardrails.logging.callbacks:Completion :: bot express greeting
user ask general question
bot response for general question
user request more information
bot provide more information
user express appreciation
bot express appreciation and offer additional help
user ask about capabilities
bot respond
INFO:nemoguardrails.logging.callbacks:Output Stats :: {'token_usage': {'total_tokens': 242, 'prompt_tokens': 197, 'completion_tokens': 45}, 'model_name': ''model''}
INFO:nemoguardrails.logging.callbacks:--- :: LLM call took 10.11 seconds
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'BotIntent', 'uid': 'e54c6d9d-0600-428e-9ae9-bc3bf5f002a5', 'event_created_at': '2023-10-13T03:03:22.745071+00:00', 'source_uid': 'NeMoGuardrails', 'intent': 'express greeting'}
INFO:nemoguardrails.flows.runtime:Event :: BotIntent {'uid': 'e54c6d9d-0600-428e-9ae9-bc3bf5f002a5', 'event_created_at': '2023-10-13T03:03:22.745071+00:00', 'source_uid': 'NeMoGuardrails', 'intent': 'express greeting'}
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'StartInternalSystemAction', 'uid': '9799827c-74f1-4648-ad82-760470291162', 'event_created_at': '2023-10-13T03:03:22.746736+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'retrieve_relevant_chunks', 'action_params': {}, 'action_result_key': None, 'action_uid': 'b0b05b1c-18d8-438a-ad19-66a7c780acdd', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Event :: StartInternalSystemAction {'uid': '9799827c-74f1-4648-ad82-760470291162', 'event_created_at': '2023-10-13T03:03:22.746736+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'retrieve_relevant_chunks', 'action_params': {}, 'action_result_key': None, 'action_uid': 'b0b05b1c-18d8-438a-ad19-66a7c780acdd', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Executing action :: retrieve_relevant_chunks
INFO:nemoguardrails.actions.action_dispatcher:Executing registered action: retrieve_relevant_chunks
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'InternalSystemActionFinished', 'uid': '8db22530-4c0c-49e8-a9c1-7f70a57b2347', 'event_created_at': '2023-10-13T03:03:22.749009+00:00', 'source_uid': 'NeMoGuardrails', 'action_uid': 'b0b05b1c-18d8-438a-ad19-66a7c780acdd', 'action_name': 'retrieve_relevant_chunks', 'action_params': {}, 'action_result_key': None, 'status': 'success', 'is_success': True, 'return_value': '', 'events': None, 'is_system_action': True, 'action_finished_at': '2023-10-13T03:03:22.749018+00:00'}
INFO:nemoguardrails.flows.runtime:Event :: InternalSystemActionFinished {'uid': '8db22530-4c0c-49e8-a9c1-7f70a57b2347', 'event_created_at': '2023-10-13T03:03:22.749009+00:00', 'source_uid': 'NeMoGuardrails', 'action_uid': 'b0b05b1c-18d8-438a-ad19-66a7c780acdd', 'action_name': 'retrieve_relevant_chunks', 'action_params': {}, 'action_result_key': None, 'status': 'success', 'is_success': True, 'return_value': '', 'events': None, 'is_system_action': True, 'action_finished_at': '2023-10-13T03:03:22.749018+00:00'}
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'StartInternalSystemAction', 'uid': '21205d17-49b8-4f8a-b692-8a279573d7d2', 'event_created_at': '2023-10-13T03:03:22.750023+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_bot_message', 'action_params': {}, 'action_result_key': None, 'action_uid': '1c1a03ae-1c1b-462d-b5e3-69909d04a56d', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Event :: StartInternalSystemAction {'uid': '21205d17-49b8-4f8a-b692-8a279573d7d2', 'event_created_at': '2023-10-13T03:03:22.750023+00:00', 'source_uid': 'NeMoGuardrails', 'action_name': 'generate_bot_message', 'action_params': {}, 'action_result_key': None, 'action_uid': '1c1a03ae-1c1b-462d-b5e3-69909d04a56d', 'is_system_action': True}
INFO:nemoguardrails.flows.runtime:Executing action :: generate_bot_message
INFO:nemoguardrails.actions.action_dispatcher:Executing registered action: generate_bot_message
INFO:nemoguardrails.actions.llm.generation:Phase 3 :: Generating bot message ...
INFO:nemoguardrails.logging.callbacks:Invocation Params :: {'model_name': ''model'', 'temperature': 0.7, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}, '_type': 'openai', 'stop': None}
INFO:nemoguardrails.logging.callbacks:Prompt :: """
Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it truthfully says it does not know.
"""

# This is how a conversation between a user and the bot can go:
user "Hello there!"
  express greeting
bot express greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "Tell me a bit about the history of NVIDIA."
  ask general question
bot response for general question
  "NVIDIA is a technology company that specializes in designing and manufacturing graphics processing units (GPUs) and other computer hardware. The company was founded in 1993 by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem."
user "tell me more"
  request more information
bot provide more information
  "Initially, the company focused on developing 3D graphics processing technology for the PC gaming market. In 1999, NVIDIA released the GeForce 256, the world's first GPU, which was a major breakthrough for the gaming industry. The company continued to innovate in the GPU space, releasing new products and expanding into other markets such as professional graphics, mobile devices, and artificial intelligence."
user "thanks"
  express appreciation
bot express appreciation and offer additional help
  "You're welcome. If you have any more questions or if there's anything else I can help you with, please don't hesitate to ask."

# This is how the bot talks:

# This is the current conversation between the user and the bot:
user "Hello there!"
  express greeting
bot express greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "Hello"
  express greeting
bot express greeting

INFO:openai:message='OpenAI API response' path=<...> processing_ms=None request_id=None response_code=200
INFO:nemoguardrails.logging.callbacks:Completion ::   "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can
INFO:nemoguardrails.logging.callbacks:Output Stats :: {'token_usage': {'total_tokens': 789, 'prompt_tokens': 532, 'completion_tokens': 257}, 'model_name': ''model''}
INFO:nemoguardrails.logging.callbacks:--- :: LLM call took 46.39 seconds
INFO:nemoguardrails.actions.llm.generation:--- :: LLM Bot Message Generation call took 46.39 seconds
INFO:nemoguardrails.actions.llm.generation:Generated bot message: Hello! How can I assist you today?
INFO:nemoguardrails.flows.runtime:Processing event: {'type': 'StartUtteranceBotAction', 'uid': 'f9ac5a60-a79f-4d2b-b863-2fab7e16bc87', 'event_created_at': '2023-10-13T03:04:09.146768+00:00', 'source_uid': 'NeMoGuardrails', 'script': 'Hello! How can I assist you today?', 'action_info_modality': 'bot_speech', 'action_info_modality_policy': 'replace', 'action_uid': '2945a74b-5f95-4fd0-be81-155626e021c8'}
INFO:nemoguardrails.flows.runtime:Event :: StartUtteranceBotAction {'uid': 'f9ac5a60-a79f-4d2b-b863-2fab7e16bc87', 'event_created_at': '2023-10-13T03:04:09.146768+00:00', 'source_uid': 'NeMoGuardrails', 'script': 'Hello! How can I assist you today?', 'action_info_modality': 'bot_speech', 'action_info_modality_policy': 'replace', 'action_uid': '2945a74b-5f95-4fd0-be81-155626e021c8'}
INFO:nemoguardrails.rails.llm.llmrails:Conversation history so far:
user "Hello"
  express greeting
bot express greeting
  "Hello! How can I assist you today?"

INFO:nemoguardrails.rails.llm.llmrails:--- :: Total processing took 102.63 seconds.
INFO:nemoguardrails.rails.llm.llmrails:--- :: Stats: 3 total calls, 102.5281150341034 total time, 1832 total tokens, 1273 total prompt tokens, 559 total completion tokens

sidgan commented 1 year ago

So the call without guardrails is a normal LLM call that takes around 40s.

When you are using guardrails, there are three parts to the conversation flow you have defined and each requires LLM calls. Based on the stack these take around 40, 10 and 46 seconds each. These LLM calls are performing a combination of understanding the user input, similarity search, generating answers and then returning the output to the user. If you want to use guardrails these LLM calls are necessary. I don't think there is a workaround without invoking the LLM.

drazvan commented 1 year ago

@lauradang: What's the exact LLM you're using? There is another problem I see in the logs: the LLM does not stop correctly and continues to generate tokens until it hits the limit. For example, for the first call it generated:

  express greeting
bot respond to greeting
  "Hello! How can I assist you today?"
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."
user "What can you do for me?"
  ask about capabilities
bot respond about capabilities
  "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based

It should have stopped after the first 3 lines. The extra tokens do add a lot of latency.

sidgan commented 1 year ago

Hi @drazvan is this because the queries are passed programmatically instead of in a chat experience?

NVIDIA / NeMo-Guardrails

Performance issues with Guardrails #154