Closed lambda-science closed 1 month ago
After more testing the Cohere Bedrock implementation is completly broken.
In pure boto3 with example from AWS doc, whatever I put for {"truncate": "END"}
The answer always is:
A client error occured: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 2127, please reformat your input and try again.
So I'm not sure it's haystack fault and that we can do anything about it.
Fix your **** Amazon.
Sample of code bug: Replace these lignes
session = boto3.Session(profile_name='<YOUR_AWS_PROFILE>')
bedrock = session.client(service_name='bedrock-runtime', region_name="<YOUR_AWS_REGION>")
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Shows how to generate text embeddings using the Cohere Embed English model.
"""
import json
import logging
import boto3
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
def generate_text_embeddings(model_id, body):
"""
Generate text embedding by using the Cohere Embed model.
Args:
model_id (str): The model ID to use.
body (str) : The reqest body to use.
Returns:
dict: The response from the model.
"""
logger.info(
"Generating text emdeddings with the Cohere Embed model %s", model_id)
accept = '*/*'
content_type = 'application/json'
session = boto3.Session(profile_name='<YOUR_AWS_PROFILE>')
bedrock = session.client(service_name='bedrock-runtime', region_name="<YOUR_AWS_REGION>")
response = bedrock.invoke_model(
body=body,
modelId=model_id,
accept=accept,
contentType=content_type
)
logger.info("Successfully generated text with Cohere model %s", model_id)
return response
def main():
"""
Entrypoint for Cohere Embed example.
"""
logging.basicConfig(level=logging.INFO,
format="%(levelname)s: %(message)s")
model_id = 'cohere.embed-english-v3'
text1 = "more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters)"
input_type = "search_document"
embedding_types = ["int8", "float"]
try:
body = json.dumps({
"texts": [text1],
"input_type": input_type,
"truncate": "NONE",
"embedding_types": embedding_types}
)
response = generate_text_embeddings(model_id=model_id,
body=body)
response_body = json.loads(response.get('body').read())
print(f"ID: {response_body.get('id')}")
print(f"Response type: {response_body.get('response_type')}")
print("Embeddings")
for i, embedding in enumerate(response_body.get('embeddings')):
print(f"\tEmbedding {i}")
print(*embedding)
print("Texts")
for i, text in enumerate(response_body.get('texts')):
print(f"\tText {i}: {text}")
except ClientError as err:
message = err.response["Error"]["Message"]
logger.error("A client error occurred: %s", message)
print("A client error occured: " +
format(message))
else:
print(
f"Finished generating text embeddings with Cohere model {model_id}.")
if __name__ == "__main__":
main()
@lambda-science I have investigated the issue.
Unrelated to the error: the __init__
method of the component contains **kwargs
, so you should initialize it as follows:
text_embedder = AmazonBedrockTextEmbedder(
model="cohere.embed-multilingual-v3",
batch_size=64,
aws_region_name=Secret.from_token("eu-west-3"),
input_type="search_document",
truncate= "END")
Other than that, I can reproduce the error and I also looked at Bedrock docs and Cohere docs.
In short, the context length of cohere.embed-multilingual-v3.0
is 512, while the Bedrock API accepts a max of 2048 characters. truncate
parameter works for texts longer than 512 but less than 2048.
Above this limit, the API gives an error.
To reproduce:
text_embedder = ...
# this works
text = " ".join("A" * 1000)
res=text_embedder.run(text=text)
print("RES:", res)
# this gives an error
text = " ".join("A" * 5000)
res=text_embedder.run(text=text)
print("RES:", res)
So, if the text is longer than 2048 characters, the user has to truncate it by himself.
Feel free to close if everything is clear.
@lambda-science I have investigated the issue.
Unrelated to the error: the
__init__
method of the component contains**kwargs
, so you should initialize it as follows:text_embedder = AmazonBedrockTextEmbedder( model="cohere.embed-multilingual-v3", batch_size=64, aws_region_name=Secret.from_token("eu-west-3"), input_type="search_document", truncate= "END")
Other than that, I can reproduce the error and I also looked at Bedrock docs and Cohere docs.
In short, the context length of
cohere.embed-multilingual-v3.0
is 512, while the Bedrock API accepts a max of 2048 characters.truncate
parameter works for texts longer than 512 but less than 2048. Above this limit, the API gives an error.To reproduce:
text_embedder = ... # this works text = " ".join("A" * 1000) res=text_embedder.run(text=text) print("RES:", res) # this gives an error text = " ".join("A" * 5000) res=text_embedder.run(text=text) print("RES:", res)
So, if the text is longer than 2048 characters, the user has to truncate it by himself.
Feel free to close if everything is clear.
Sorry for late answer, yeah while investigating I noticed the same thing. It's a bit stupid implementation from AWS but not Haystack-related to I'm closing :)
Describe the bug When calling the Cohere Embedding model with a value for "truncate" value, the parameter is not working as expected. If the max character limit is reached, it will still raise an error, even with truncate = "END" or "START"
To Reproduce
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 2128, please reformat your input and try again.
I specified
"truncate": "END"
so it shouldn't crash as per official documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.htmlIf you specify START or END, the model discards the input until the remaining input is exactly the maximum input token length for the model. Here also: https://docs.cohere.com/reference/embed
In the code, that parameters is taken into consideration, but it's not working ? TextEmbedder: https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/src/haystack_integrations/components/embedders/amazon_bedrock/text_embedder.py#L144-L146 DocumentEmbedder: https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/src/haystack_integrations/components/embedders/amazon_bedrock/document_embedder.py#L168-L176
Describe your environment (please complete the following information):