[Feature Request / Question] Why using the full input for RAG with cohere instead of making a condensed response

jeremylatorre commented 4 weeks ago

Describe the solution you'd like

When using a custom bot, it implies that the maximum context window is reduced to 2048 because it's fully managed by Cohere. Why don't use Claude3 to reduce the input question to 2048 token maximum to avoid having errors on client side when the user set a large input. And then send the reduced question to Cohere.

I've already discussed this with an AWS SA in France who agreed with this.

Why the solution needed

Avoid having errors when using the RAG for users with large context windows

Additional context

Implementation feasibility

Are you willing to discuss the solution with us, decide on the approach, and assist with the implementation?

[ X] Yes
[ ] No

jeremylatorre commented 3 weeks ago

I think we could use haiku to limit the cost of rephrasing when using RAG

statefb commented 3 weeks ago

I agree, but the response will be delayed. This feature should be optional.

aws-samples / bedrock-claude-chat