Azure-Samples / graphrag-accelerator

One-click deploy of a Knowledge Graph powered RAG (GraphRAG) in Azure
https://github.com/microsoft/graphrag
MIT License
1.65k stars 250 forks source link

[BUG] - "Indexing failed at 12.5 %" #139

Open doruit opened 1 month ago

doruit commented 1 month ago

Describe the bug Stuck at the indexing job. After this message:

<Response [200]>
{"status":"Indexing operation scheduled"}

I'm checking the status every now and then, after a while get this:

{ 'status_code': 200, 'index_name': 'index-2', 'storage_name': 'testdata1', 'status': 'failed', 'percent_complete': 12.5, 'progress': '2 out of 16 workflows completed successfully.', }

To Reproduce Steps to reproduce the behavior:

  1. Follow deployment guide
  2. Downloaded a small set of wikipedia articles
  3. Install all dependencies for the Quickstart notebook "1-Quickstart.ipynb"
  4. Run the notebook
  5. Validated that all steps until the indexing job run successful
  6. At step "Build an Index" the response is "{"status":"Indexing operation scheduled"}", however it does not seem to be created
  7. At the step "Check status of an indexing job" it got stuck at 'percent-complete': 12.5
  8. I checked AI Search Service if the index is created at some point but it was not created at all

Expected behavior At the indexing job i expect the job to finish succesfully.

Screenshots n/a

Desktop (please complete the following information):

Additional context n/a

timothymeyers commented 4 weeks ago

Any luck @doruit? Did you happen to try running again?

When you kick of an indexing run, a kubernetes job is spun up (within about 5 mins). If you ran deploy.sh, you should be able to

watch kubectl get jobs -n graphrag

and wait for the indexing job to appear. Then

kubectl logs job/<indexing job name> -n graphrag -f

to watch the logs to monitor progress. You'll possibly see some 503 and 429 errors, which is normal as the indexer runs out of tokens and has to wait for the rate limiter to let it back in. (There's ongoing work to clean this up)

But, if for some reason your indexer dies you would be able to see what happened when it did.

doruit commented 3 weeks ago

@timothymeyers, just did a fresh deployment to rule out some possible causes.....

I've checked the storage account, it seems that the files are uploaded to a container with a random name where i expected a name i declared in the notebook:

file_directory = "testdata" storage_name = "testdata" index_name = "index1"

However, the files are uploaded in a container with the number as name instead:

image

Is this expected?

rnpramasamyai commented 3 weeks ago

@doruit Please check logs of your indexing pod and you will get an idea.

timothymeyers commented 3 weeks ago

However, the files are uploaded in a container with the number as name instead. Is this expected?

Hi @doruit - yes this is the expected behavior. The names that you give are hashed to improve the overall security posture.

Did you run into the same issues during indexing with your new deployment? Did you happen to try inspecting the index pod logs like I mentioned?

doruit commented 3 weeks ago

Hi @timothymeyers, earlier i saw in the indexing pod logs that the token limit is reached many times. To me strange as i'm using the following TPM settings:

image

Should be sufficient right? I have also turned of dynamic quota allocation.

When looking at the monitor of jobs it says no jobs running:

image

When checking the job status from the notebook at the same time it says:

image
rnpramasamyai commented 3 weeks ago

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

doruit commented 3 weeks ago

@rnpramasamyai, i've added the api_key property:

image

After this i did a rerun of the Quickstart notebook to build a new index:

image

But now the indexing manager does not seem to instantiate an indexing job at all.

Should i remove the graphrag namespace and run the deployment again ?

rnpramasamyai commented 3 weeks ago

@doruit. Please run deployment script again.

doruit commented 3 weeks ago

Deployment was successful however indexing is still not working. Should the API version match the value from the deployment documentation or the API version that is mentioned in the Playground > View Code window:

image image
rnpramasamyai commented 3 weeks ago

@doruit Please always check the pod's logs if indexing is not working and post that logs.

doruit commented 3 weeks ago

I did a full deployment again, check all parameters and ran the notebook again from the start. After running the step "Build an Index" i get this message:

{
    'status_code': 200,
    'index_name': 'index7',
    'storage_name': 'testdata',
    'status': 'scheduled',
    'percent_complete': 0.0,
    'progress': '',
}

At the same time i'm watching the logs and wait for the indexing job to come-by but i only get messages from the graphrag index manager every 5 minutes:

Every 2.0s: kubectl get jobs -...  SandboxHost-638599057829007509: Thu Aug 22 11:25:28 2024

NAME                              COMPLETIONS   DURATION   AGE
graphrag-index-manager-28738765   1/1           25s        28s

This is my parameters file:

{
  "GRAPHRAG_API_BASE": "https://aoai-graphrag-tst-francecentral.openai.azure.com",
  "GRAPHRAG_API_VERSION": "2024-02-15-preview",
  "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "text-embedding-ada-002",
  "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002",
  "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gpt-4o",
  "GRAPHRAG_LLM_MODEL": "gpt-4o",
  "LOCATION": "francecentral",
  "RESOURCE_GROUP": "rg-graphrag-tst-04"
}

Not sure where to look now as the indexing job does not start at all anymore. What region, LLM model version, API version, etc should i use as reference?

rnpramasamyai commented 3 weeks ago

@doruit Indexing will take time to complete.

doruit commented 3 weeks ago

@rnpramasamyai, i've waited for an hour, but it seems it will not start nor can i find any clue where to look for errors. In the job log i only see this message every 5 minutes:

image

What else can i check/ rule out?

doruit commented 3 weeks ago

@rnpramasamyai @timothymeyers I changed from a CSP tenant to deploy to my MSDN tenant/subscription, did the full deployment and it now seem to work:

HTTP/1.1 200 OK
content-length: 172
content-type: application/json
date: Fri, 23 Aug 2024 11:53:38 GMT
request-context: appId=cid-v1:xxxxxxxxxxxxxx
vary: Origin

{
    "status_code": 200,
    "index_name": "index1",
    "storage_name": "testdata",
    "status": "complete",
    "percent_complete": 100.0,
    "progress": "16 out of 16 workflows completed successfully."
}

I checked if quota or Azure policy caused the issue in the CSP tenant/subscription, however i could not find any log so for to rule everything out.

There is only 1 policy that might impact the creation of VM/VMSS. The policy requires VMs to have managed disks, which they all have so i guess the policy won't block anything. Other policy blocks creating classic resources.

However, the good news is that with the alternative method the deployment was successful.

eai-douglaswross commented 2 weeks ago

Firstly: thank you for this repo, and thanks for trying to help us punters understand what you have written.

I do have the same issues with stopping at 2/16 workflows. 12.5% . I do not have a MSDN tenant, however we do not have any policies specifically added into our tenant. It is very new, and out of the box.

Pod log command, does not seem to work??? tried with both names when job was running:

graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/graphrag-index-manager-28746945 -n graphrag -f Indexing job for 'indtestdata' already running. Will not schedule another. Exiting... graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/indtestdata -n graphrag -f error: error from server (NotFound): jobs.batch "indtestdata" not found in namespace "graphrag"

Can I suggest / request - as it may make everyone's job a little easier:

  1. Enable Azure AI Search access from Portal when in DEV deployment more: Can you set some variable in the deployment like deployment_type=<dev/prod> so that all of the blocking of Azure Portal access to the AI search Index is turned off in dev mode, and then locked down for a prod deployment.
  2. Add the option for a VM into deployment of the infra into the private network, so that we can use this method: https://learn.microsoft.com/en-gb/azure/search/service-create-private-endpoint#use-the-azure-portal-to-access-a-private-search-service
  3. Provide some instructions about manually putting a VM in the private network via azure online portal, so we can remote to it and access the Azure Portal functionality like suggested in that link

i.e. It is very difficult to see what is going on, to try to understand what is going wrong.

Lastly, when you add a comment like:

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

For the rest of us trying to follow along, would you mind telling us quickly why you are suggesting that, so that we can also understand why it might fix the issue.

doruit commented 2 weeks ago

I still don't know what caused the process to get stuck. It was not due to Azure policy or api_key in the pipeline-settings.yaml. Perhaps the model and API version might cause the issue. In other issue threads they mention that if the vector size is slightly different from what is expected, the indexing will fail.

@timothymeyers, in deployment.md it looks like the API version is fixed to "2023-03-15-preview". Is that correct? or should the documentation instruct the developer to get the right API version from the deployed model (i.e. via the portal)?