Embeddings vector dimensions mismatch indexer error

DuboisABB commented 1 month ago

This issue is for a: (mark with an `x`)

- [X] bug report -> please search issues before submitting

Minimal steps to reproduce

Set .env variables as follows: AZURE_OPENAI_EMB_DEPLOYMENT="text-embedding-3-large" AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY=350 AZURE_OPENAI_EMB_DEPLOYMENT_VERSION=1 AZURE_OPENAI_EMB_DIMENSIONS=1536 USE_FEATURE_INT_VECTORIZATION="true"

Then do azd up

Any log messages given by the failure

When the indexer tries to run, it fails with this:

There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '3072'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.

When inspecting the code for gptkbindex-skillset in the portal, I notice this bit of code:

{
  "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
  "name": "#2",
  "description": "Skill to generate embeddings via Azure OpenAI",
  "context": "/document/pages/*",
  "resourceUri": "https://cog-trnz2cbjn4ofs.openai.azure.com",
  "apiKey": null,
  "deploymentId": "text-embedding-3-large",
  "dimensions": null,
  "modelName": null

So dimensions and modelName are null. Additonnally, there is this warning in a banner above the code:

This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality.

If I manually change the skillset code in the portal with this, it works:

      "dimensions": 1536,
      "modelName": "text-embedding-3-large",

I tried to change the code in integratedvectorizerstrategy.py to this:

        import os
        embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
        embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')

        embedding_skill = AzureOpenAIEmbeddingSkill(
            description="Skill to generate embeddings via Azure OpenAI",
            context="/document/pages/*",
            resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
            deployment_id=self.embeddings.open_ai_deployment,
            dimensions=embeddingDimensions,
            modelName=embeddingModelName,
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/pages/*"),
            ],
            outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
        )

However, for some reason, this doesn't change the code for the skillset that I see in the portal, even if I delete the skillset completely to make sure that it gets regenerated.

Expected/desired behavior

No indexer error.

OS and Version?

Windows 11

azd version?

azd version 1.9.5 (commit cd2b7af9995d358aab33c782614f801ac1997dde)

Versions

I merged the last commit from 2024-07-16 (main #1789) into my local fork. So I do have some local code modifications but AFAIK, none that would affect this.

DuboisABB commented 1 month ago

Ok so I just read in the doc that integrated vectorization is incompatible with the newer embedding models: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/deploy_features.md#enabling-authentication

However, MS docs seem to indicate that it's indeed compatible: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal-import-vectors?tabs=sample-data-storage%2Cmodel-aoai

So I guess that this bug report is turning into a feature request.

DuboisABB commented 1 month ago

Further investigation, it looks like class AzureOpenAIEmbeddingSkill doesnt support dimensions or model_name, in .venv\Lib\site-packages\azure\search\documents\indexes_generated\models_models_py3.py

However, the documentation for that class mentions that it should support that: https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.azureopenaiembeddingskill?view=azure-python-preview

So for some reason we are using an old SDK. I'm pushing my knowledge at this point, I have no idea how to use the latest SDK.

DuboisABB commented 1 month ago

One more update... I figured out how to get the latest SDK. I changed this line in requirements.txt: azure-search-documents==11.6.0b4

Then azd up correctly updates _models_py3.py with the updated AzureOpenAIEmbeddingSkill class. Then, this modified code seems to work (just added dimensions and model_name parameters):

        import os
        embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
        embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')

        embedding_skill = AzureOpenAIEmbeddingSkill(
            description="Skill to generate embeddings via Azure OpenAI",
            context="/document/pages/*",
            resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
            deployment_id=self.embeddings.open_ai_deployment,
            dimensions=embeddingDimensions,
            model_name=embeddingModelName,
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/pages/*"),
            ],
            outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
        )

However, I still get this error at the prepdocs.py step:

File "C:\Programming\bochat9.venv\Lib\site-packages\azure\search\documents\indexes_generated\aio\operations_indexes_operations.py", line 192, in create raise HttpResponseError(response=response, model=error) azure.core.exceptions.HttpResponseError: () The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'. Code: Message: The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'.

DuboisABB commented 1 month ago

One more required change, we need to add model_name to the call to AzureOpenAIVectorizer:

        await search_manager.create_index(
            vectorizers=[
                AzureOpenAIVectorizer(
                    name=f"{self.search_info.index_name}-vectorizer",
                    kind="azureOpenAI",
                    azure_open_ai_parameters=AzureOpenAIParameters(
                        resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
                        deployment_id=self.embeddings.open_ai_deployment,
                        model_name=embeddingModelName,  # Added this line to be compatible with API version '2024-05-01-preview'
                    ),
                ),
            ]
        )

I also had forgotten that I also changed this bit in strategy.py:

    def create_search_indexer_client(self) -> SearchIndexerClient:
        return SearchIndexerClient(endpoint=self.endpoint, credential=self.credential, api_version="2024-05-01-preview")

Now prepdocs.py runs without errors.

advanced-flow commented 3 weeks ago

I deployed the app with the new "text-embedding-3-large" supporting 3072 dimensions and had no problems with it.

It is not only important that the skillset of the indexer is set up for the correct dimensions, but also that the "field embedding" of the index is set up for these 3072 dimensions.

It should work if you set the env variable "AZURE_OPENAI_EMB_DIMENSIONS=3072" before running azd up

Azure-Samples / azure-search-openai-demo