502 #495 #230

Description of changes:

To add usage tracking to bedrock models, I migrated the langchain chain for the claude model (ConversationChain is deprecated)

Instead it uses RunnableWithMessageHistory with ChatBedrockConverse that relied on the Bedrock Converse API that is consistent across models and provide the usage in the response.

Changes

Migrate chain for claude Bedrock
Add Usage to metadata
Add Usage JSON log
Add Cloudwatch filter parsing the log and generating a metrics (With a CLI config due to added cosst)
Add Metric to dashboard.
Added npm run vet-all to quickly verify formatting and tests

Testing

Verified with and without RAG and Streaming
Ran integ tests

Future Change

Migrate all models since ConversionChain is deprecated
Add usage tracking to Sagemaker endpoint (if possible)

Note: This change is modifying the prompts to match the new Langchain patterns. For example: Before

The following is a friendly conversation between a human and an AI. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: test
AI: I'm afraid I don't have enough context to answer that question. Could you please provide more details?
Human: test
AI: I apologize,...

After

System: The following is a friendly conversation between a human and an AI.If the AI does not know the answer to a question, it truthfully says it does not know.
Human: test
AI: I'm afraid I don't have enough context to answer your question. Could you please provide more details?
Human: test
AI: I don't have enough information to answer your question. The context provided mentions an Integ Test flower that is yellow, but does not include a direct question.

Image with metdata usage Dashboard

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

gbone-restore commented 2 months ago

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

charles-marion commented 2 months ago

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

The memory used by RunnableWithMessageHistory in this change is this class https://github.com/aws-samples/aws-genai-llm-chatbot/blob/9de3e559a4d744aab2091290e008cd620d9cb5a2/lib/shared/layers/python-sdk/python/genai_core/langchain/chat_message_history.py#L48

To implement it I would just add a max messages returned parameter: https://github.com/aws-samples/aws-genai-llm-chatbot/blob/9de3e559a4d744aab2091290e008cd620d9cb5a2/lib/model-interfaces/langchain/functions/request-handler/adapters/base/base.py#L111 (because you still want to store/return the full history to view the session)

This would make this independent of the chain.

Do you think that limiting the message history to a smaller slice of data is an important feature? I do agree since it would reduce the number of token used but it would need to be configurable somewhere.

aws-samples / aws-genai-llm-chatbot

feat: Add token usage to Bedrock Claude + Migrated chain for this model #564

502 #495 #230