deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
99 stars 91 forks source link

Bedrock: Cohere Embed truncate parameter not working. #912

Closed lambda-science closed 1 month ago

lambda-science commented 1 month ago

Describe the bug When calling the Cohere Embedding model with a value for "truncate" value, the parameter is not working as expected. If the max character limit is reached, it will still raise an error, even with truncate = "END" or "START"

To Reproduce

from haystack.utils.auth import Secret
from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockTextEmbedder
text_embedder = AmazonBedrockTextEmbedder(
    model="cohere.embed-multilingual-v3", 
    batch_size=64, 
    aws_region_name=Secret.from_token("eu-west-3"),
    kwargs={"input_type": "search_document", "truncate": "END"})
text_embedder.run("more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) ") # Crashing

botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 2128, please reformat your input and try again.

I specified "truncate": "END" so it shouldn't crash as per official documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html

truncate – Specifies how the API handles inputs longer than the maximum token length. Use one of the following:

NONE – (Default) Returns an error when the input exceeds the maximum input token length.

START – Discards the start of the input.

END – Discards the end of the input.

If you specify START or END, the model discards the input until the remaining input is exactly the maximum input token length for the model. Here also: https://docs.cohere.com/reference/embed

In the code, that parameters is taken into consideration, but it's not working ? TextEmbedder: https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/src/haystack_integrations/components/embedders/amazon_bedrock/text_embedder.py#L144-L146 DocumentEmbedder: https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/src/haystack_integrations/components/embedders/amazon_bedrock/document_embedder.py#L168-L176

Describe your environment (please complete the following information):

lambda-science commented 1 month ago

After more testing the Cohere Bedrock implementation is completly broken. In pure boto3 with example from AWS doc, whatever I put for {"truncate": "END"} The answer always is: A client error occured: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 2127, please reformat your input and try again.

So I'm not sure it's haystack fault and that we can do anything about it.

Fix your **** Amazon.

lambda-science commented 1 month ago

Sample of code bug: Replace these lignes

    session = boto3.Session(profile_name='<YOUR_AWS_PROFILE>')
    bedrock = session.client(service_name='bedrock-runtime', region_name="<YOUR_AWS_REGION>")
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Shows how to generate text embeddings using the Cohere Embed English model.
"""
import json
import logging
import boto3

from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

def generate_text_embeddings(model_id, body):
    """
    Generate text embedding by using the Cohere Embed model.
    Args:
        model_id (str): The model ID to use.
        body (str) : The reqest body to use.
    Returns:
        dict: The response from the model.
    """

    logger.info(
        "Generating text emdeddings with the Cohere Embed model %s", model_id)

    accept = '*/*'
    content_type = 'application/json'
    session = boto3.Session(profile_name='<YOUR_AWS_PROFILE>')
    bedrock = session.client(service_name='bedrock-runtime', region_name="<YOUR_AWS_REGION>")

    response = bedrock.invoke_model(
        body=body,
        modelId=model_id,
        accept=accept,
        contentType=content_type
    )

    logger.info("Successfully generated text with Cohere model %s", model_id)

    return response

def main():
    """
    Entrypoint for Cohere Embed example.
    """

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    model_id = 'cohere.embed-english-v3'
    text1 = "more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters) more than the limit (2048 characters)"
    input_type = "search_document"
    embedding_types = ["int8", "float"]

    try:

        body = json.dumps({
            "texts": [text1],
            "input_type": input_type,
            "truncate": "NONE",
            "embedding_types": embedding_types}
        )
        response = generate_text_embeddings(model_id=model_id,
                                            body=body)

        response_body = json.loads(response.get('body').read())

        print(f"ID: {response_body.get('id')}")
        print(f"Response type: {response_body.get('response_type')}")

        print("Embeddings")
        for i, embedding in enumerate(response_body.get('embeddings')):
            print(f"\tEmbedding {i}")
            print(*embedding)

        print("Texts")
        for i, text in enumerate(response_body.get('texts')):
            print(f"\tText {i}: {text}")

    except ClientError as err:
        message = err.response["Error"]["Message"]
        logger.error("A client error occurred: %s", message)
        print("A client error occured: " +
              format(message))
    else:
        print(
            f"Finished generating text embeddings with Cohere model {model_id}.")

if __name__ == "__main__":
    main()
anakin87 commented 1 month ago

@lambda-science I have investigated the issue.

Unrelated to the error: the __init__ method of the component contains **kwargs, so you should initialize it as follows:

text_embedder = AmazonBedrockTextEmbedder(
    model="cohere.embed-multilingual-v3", 
    batch_size=64, 
    aws_region_name=Secret.from_token("eu-west-3"),
    input_type="search_document",
    truncate= "END")

Other than that, I can reproduce the error and I also looked at Bedrock docs and Cohere docs.

In short, the context length of cohere.embed-multilingual-v3.0 is 512, while the Bedrock API accepts a max of 2048 characters. truncate parameter works for texts longer than 512 but less than 2048. Above this limit, the API gives an error.

To reproduce:

text_embedder = ...

# this works
text = " ".join("A" * 1000)
res=text_embedder.run(text=text)
print("RES:", res)

# this gives an error
text = " ".join("A" * 5000)
res=text_embedder.run(text=text)
print("RES:", res)

So, if the text is longer than 2048 characters, the user has to truncate it by himself.

Feel free to close if everything is clear.

lambda-science commented 1 month ago

@lambda-science I have investigated the issue.

Unrelated to the error: the __init__ method of the component contains **kwargs, so you should initialize it as follows:

text_embedder = AmazonBedrockTextEmbedder(
    model="cohere.embed-multilingual-v3", 
    batch_size=64, 
    aws_region_name=Secret.from_token("eu-west-3"),
    input_type="search_document",
    truncate= "END")

Other than that, I can reproduce the error and I also looked at Bedrock docs and Cohere docs.

In short, the context length of cohere.embed-multilingual-v3.0 is 512, while the Bedrock API accepts a max of 2048 characters. truncate parameter works for texts longer than 512 but less than 2048. Above this limit, the API gives an error.

To reproduce:

text_embedder = ...

# this works
text = " ".join("A" * 1000)
res=text_embedder.run(text=text)
print("RES:", res)

# this gives an error
text = " ".join("A" * 5000)
res=text_embedder.run(text=text)
print("RES:", res)

So, if the text is longer than 2048 characters, the user has to truncate it by himself.

Feel free to close if everything is clear.

Sorry for late answer, yeah while investigating I noticed the same thing. It's a bit stupid implementation from AWS but not Haystack-related to I'm closing :)