aws-samples / aws-genai-llm-chatbot

A modular and comprehensive solution to deploy a Multi-LLM and Multi-RAG powered chatbot (Amazon Bedrock, Anthropic, HuggingFace, OpenAI, Meta, AI21, Cohere, Mistral) using AWS CDK on AWS
https://aws-samples.github.io/aws-genai-llm-chatbot/
MIT No Attribution
1.11k stars 332 forks source link

feat: Add token usage to Bedrock Claude + Migrated chain for this model #564

Closed charles-marion closed 2 months ago

charles-marion commented 2 months ago

Issue #, if available:

502 #495 #230

Description of changes:

To add usage tracking to bedrock models, I migrated the langchain chain for the claude model (ConversationChain is deprecated)

Instead it uses RunnableWithMessageHistory with ChatBedrockConverse that relied on the Bedrock Converse API that is consistent across models and provide the usage in the response.

Changes

Testing

Future Change

Note: This change is modifying the prompts to match the new Langchain patterns. For example: Before

The following is a friendly conversation between a human and an AI. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: test
AI: I'm afraid I don't have enough context to answer that question. Could you please provide more details?
Human: test
AI: I apologize,...

After

System: The following is a friendly conversation between a human and an AI.If the AI does not know the answer to a question, it truthfully says it does not know.
Human: test
AI: I'm afraid I don't have enough context to answer your question. Could you please provide more details?
Human: test
AI: I don't have enough information to answer your question. The context provided mentions an Integ Test flower that is yellow, but does not include a direct question.

Image with metdata usage image Dashboard image

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

gbone-restore commented 2 months ago

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

charles-marion commented 2 months ago

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

The memory used by RunnableWithMessageHistory in this change is this class https://github.com/aws-samples/aws-genai-llm-chatbot/blob/9de3e559a4d744aab2091290e008cd620d9cb5a2/lib/shared/layers/python-sdk/python/genai_core/langchain/chat_message_history.py#L48

To implement it I would just add a max messages returned parameter: https://github.com/aws-samples/aws-genai-llm-chatbot/blob/9de3e559a4d744aab2091290e008cd620d9cb5a2/lib/model-interfaces/langchain/functions/request-handler/adapters/base/base.py#L111 (because you still want to store/return the full history to view the session)

This would make this independent of the chain.

Do you think that limiting the message history to a smaller slice of data is an important feature? I do agree since it would reduce the number of token used but it would need to be configurable somewhere.