Storage container names don't match storage_name parameter - is this causing indexing to fail?

brian-mayer commented 1 week ago

Describe the bug I've deployed the infrastructure and it all seems to have deployed successfully. I am able to walk through the Jupyter Quickstart notebook and use the API upload the recommended sample UTF-8 text documents. Indexing 'seems' to start per the API message but stops at 6.25% or 12.5%. No indexes ever show up on the Azure AI Search instance.

To Reproduce Steps to reproduce the behavior:

Deploy accelerator solution
Use Jupyter notebook Quickstart to walk through API calls

Upload sample UTF-8 files successfully into BLOB container, however containers have random file identifer strings - not specified storage_name parameter as the container name - example: 345yu37291db2aa8ced66f43edw5f6n7
Try to start an indexing job using notebook API call
Indexing job initiates but fails - either at 6.25% or 12.5%

Looks like this when API is queried for status { "status_code": 200, "index_name": "wiki-articles-index", "storage_name": "wiki-articles-storage", "status": "failed", "percent_complete": 12.5, "progress": "2 out of 16 workflows completed successfully."

Expected behavior I expect the index will be built so I can query it

Desktop (please complete the following information):

OS: MacOS
Version 14.4.1

Additional context I've tried restarting the graphrag AKS containers and tried stripping down the files being processed to just one file. Nothing has altered the outcome of no apparent indexing happening. Is this related to the container names not matching the storage_name parameter input in the Jupyter Quickstart cell?

jgbradley1 commented 5 days ago

Hello @brian-mayer! The storage_name will not match the actual name of the blob container. For better security posture, we first sanitize the name provided by an API end-user by computing a hash and use that hash as the actual blob container name. The hash calculation from a user-provided storage_name string is done in this function to be exact.

jgbradley1 commented 3 days ago

To assist with debugging, there is one place you can look for additional logging. In the Azure Storage instance that gets deployed within the resource group at deployment time, there will be a blob container with the name reports. That is a continuously running log of the FastAPI application so if there are errors, you might see errors logged there. Also within the blob container that is associated with the hash of the index_name you tried to build, there is a reports directory that contains a log file associated with the indexing job. That file will contain all output from running the indexing job. If you tried to run the same indexing job multiple times, there will be a separate log file per attempt.

We are looking into hooking these logs up to App Insights so you don’t have to go hunt for these log files manually. The code to support App Insights is in the codebase but has not been fully tested again due to some recent changes we made so we never turned back on this form of logging by default.

We will look into it soon and try to get better logging enabled by default again.

Azure-Samples / graphrag-accelerator

Storage container names don't match storage_name parameter - is this causing indexing to fail? #47