Azure-Samples / graphrag-accelerator

One-click deploy of a Knowledge Graph powered RAG (GraphRAG) in Azure
https://github.com/microsoft/graphrag
MIT License
1.11k stars 155 forks source link

[BUG]Indexing Job Get stuck ( more details below ) #55

Open c0derm4n opened 1 week ago

c0derm4n commented 1 week ago

Describe the bug Running Quickstart.ipynb - Start new indexing job. When I checked the progress yesterday, it was 6.25%. It has been 24 hours now, and it is still 6.25%!

{ 'status_code': 200, 'index_name': 'graph_rag_index_0705_v3', 'storage_name': 'graph_rag_storage_0705_v3', 'status': 'running', 'percent_complete': 6.25, 'progress': "Workflow 'create_base_extracted_entities' started.", }

Expected behavior Indexing should be 100% complete

Screenshots image

Desktop (please complete the following information):

jgbradley1 commented 1 week ago

How much data did you upload before starting an indexing job? Did you use the wikipedia sample data download script to get a couple of files or provide your own data?

c0derm4n commented 1 week ago

I have uploaded 700 papers in txt format (parsed from PDF), totaling 17M @jgbradley1

How much data did you upload before starting an indexing job? Did you use the wikipedia sample data download script to get a couple of files or provide your own data?

c0derm4n commented 1 week ago

The maximum file size is 100KB

I have uploaded 700 papers in txt format (parsed from PDF), totaling 17M @jgbradley1

How much data did you upload before starting an indexing job? Did you use the wikipedia sample data download script to get a couple of files or provide your own data?

dandinu commented 1 week ago

Interestingly enough, my job has also been stuck for about 2h now. It's a single txt file or about 3kB. I get this over and over:

{
    'status_code': 200,
    'index_name': 'findata',
    'storage_name': 'findata',
    'status': 'running',
    'percent_complete': 12.5,
    'progress': '2 out of 16 workflows completed successfully.',
}

Is there any way to debug this in the Azure Console?

jgbradley1 commented 1 week ago

In the solution accelerator, a common reason why jobs can appear to run a long time comes down to the TPM/RPM quota of the Azure OpenAI instance you're using and the retry logic that is configured.

I would first like to direct your attention to this config file in the accelerator. This config file is very similar to the config file used by the graphrag library. A complete description of the config fields is documented here. In the solution accelerator, we set some config fields dynamically and some are hardcoded. The TPM/RPM is one of the hardcoded values.

If you have deployed the accelerator with an AOAI model deployment with a TPM/RPM threshold much lower than what the hardcoded values are set to, then that could be causing the graphrag library to hit the rate limit thresholds of your AOAI model very quickly. The accelerator has two levels of retry logic implemented

  1. inside the graphrag library itself, rate limiting and retry logic is configured with some default values.
  2. in the AKS deployment, when a user kicks off an idexing job, the indexing job is actually run as a k8s Job. If the k8s job fails (i.e. due to the graphrag library erroring out after too many retries), AKS will attempt to restart the job a maximum number of times (defined here).

One way to debug the indexing job while it's running is to use kubectl. Assuming you ran the deploy.sh script following the deployment guide, then kubectl will already be logged in to the AKS instance (see this guide if not).

Now you can run kubectl get pods which will print out at least three pod names that all start with "graphrag-*". If you have an active running indexing job, you should also see another k8s pod with the name indexing-job-<hash>. You can run kubectl logs indexing-job-<hash> and will see the entire log output from the indexing job which may provide more detailed information that would explain the errors you are experiencing.

This accelerator is meant to be a reference architecture which means you may need to modify it slightly to fit the needs of your environment and intended usage. I would suggest testing your deployment on just a few files at first (i.e. not 700 files) to get a feel for how long indexing will take given your TPM/RPM quota. Do be aware that step 2 of the graphrag indexing pipeline is entity extraction - this is the most time-intensive step of the entire indexing pipeline. 80-90% of the total time to index data is contained within this step. If multiple indexing jobs are kicked off by an API user and the appropriate amount of TPM/RPM quota has not been allocated, you can easily overload your AOAI model rate limits by too many simultaneous indexing jobs (implementing some sort of indexing job management queue is on the list of things we'd like to tackle which may help solve this problem).

Note: once you start modifying the graphrag pipeline settings to fit your needs, you are able to rerun the deploy.sh script on the same resource group as your original deployment. The deployment script was written in a way that rerunning it on a previous resource group deployment of graphrag should not cause a problem. It will only update changes that you make (and take around 5 minutes to run).

dandinu commented 1 week ago

Thanks for the very useful guide. It seems that I may have misconfigured the deployment maybe, since I see this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 69, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 373, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 167, in handle_async_request
    raise UnsupportedProtocol(
httpcore.UnsupportedProtocol: Request URL is missing an 'http://' or 'https://' protocol.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1537, in _request
    response = await self._client.send(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1661, in send
    response = await self._send_handling_auth(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1689, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1763, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 372, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.UnsupportedProtocol: Request URL is missing an 'http://' or 'https://' protocol.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/claims/claim_extractor.py", line 121, in __call__
    claims = await self._process_document(prompt_args, text, doc_index)
  File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/claims/claim_extractor.py", line 165, in _process_document
    response = await self._llm(
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in __call__
    result = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in __call__
    return await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in __call__
    output = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/caching_llm.py", line 104, in __call__
    result = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 177, in __call__
    result, start = await execute_with_retry()
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry
    async for attempt in retryer:
  File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 166, in __anext__
    do = await self.iter(retry_state=self._retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 165, in execute_with_retry
    return await do_attempt(), start
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 147, in do_attempt
    return await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 49, in __call__
    return await self._invoke(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 53, in _invoke
    output = await self._execute_llm(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 55, in _execute_llm
    completion = await self.client.chat.completions.create(
  File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 1289, in create
    return await self._post(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1805, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1503, in request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1571, in _request
    raise APIConnectionError(request=request) from err
openai.APIConnectionError: Connection error.
[ERROR] 2024-07-07 18:33:30,195 - Claim Extraction Error
[ERROR] 2024-07-07 18:33:30,931 - Error Invoking LLM
[ERROR] 2024-07-07 18:33:31,379 - Error Invoking LLM
[ERROR] 2024-07-07 18:33:32,159 - Error Invoking LLM

It seems like the problem is this:

httpcore.UnsupportedProtocol: Request URL is missing an 'http://' or 'https://' protocol.

But I do not understand where this is coming from, since my deploy.parameters.json look like this:

{
  "GRAPHRAG_API_BASE": "graphragdaneast2",
  "GRAPHRAG_API_VERSION": "2024-02-15-preview",
  "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "embed",
  "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002",
  "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gpt4",
  "GRAPHRAG_LLM_MODEL": "gpt-4",
  "LOCATION": "eastus2",
  "RESOURCE_GROUP": "GraphRagTest"
}

What am I missing?

dandinu commented 1 week ago

I just realised that my problem was putting the wrong format for GRAPHRAG_API_BASE, so I changed that to the actual Azure OpenAI Endpoint URL. That was a bit dumb. Testing again.

c0derm4n commented 1 week ago

I just realised that my problem was putting the wrong format for GRAPHRAG_API_BASE, so I changed that to the actual Azure OpenAI Endpoint URL. That was a bit dumb. Testing again.

Have you solved your problem?

jgbradley1 commented 1 week ago

The API endpoint is expected to be provided with the following format:

GRAPHRAG_API_BASE=https://<myname>.openai.azure.com>

In the documentation, I think we can provide more clarification/examples for each of the deployment variables so that it is clearer in the future.

c0derm4n commented 1 week ago

this is the log after i run kubectl logs indexing-job-e8eca148d5bc2b0004e7dc49db249490-qgl5q:

openai.RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Rate limit is exceeded. Try again in 5 seconds.'}}
[ERROR] 2024-07-08 01:46:56,830 - Claim Extraction Error
WARNING:graphrag.llm.base.rate_limiting_llm:Process failed to invoke LLM 1/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 0 seconds. Follow recommendation? True
[ERROR] 2024-07-08 01:46:57,288 - Error Invoking LLM
WARNING:graphrag.llm.base.rate_limiting_llm:Process failed to invoke LLM 3/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 0 seconds. Follow recommendation? True
[ERROR] 2024-07-08 01:46:57,300 - Error Invoking LLM
WARNING:graphrag.llm.base.rate_limiting_llm:Process failed to invoke LLM 10/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 0 seconds. Follow recommendation? True
[ERROR] 2024-07-08 01:46:57,336 - Error Invoking LLM
ERROR:graphrag.index.graph.extractors.claims.claim_extractor:error extracting claim
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/claims/claim_extractor.py", line 121, in __call__
    claims = await self._process_document(prompt_args, text, doc_index)
  File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/claims/claim_extractor.py", line 165, in _process_document
    response = await self._llm(
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in __call__
    result = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in __call__
    return await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in __call__
    output = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/caching_llm.py", line 104, in __call__
    result = await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 177, in __call__
    result, start = await execute_with_retry()
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry
    async for attempt in retryer:
  File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 166, in __anext__
    do = await self.iter(retry_state=self._retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 165, in execute_with_retry
    return await do_attempt(), start
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 151, in do_attempt
    await sleep_for(sleep_time)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 147, in do_attempt
    return await self._delegate(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 49, in __call__
    return await self._invoke(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 53, in _invoke
    output = await self._execute_llm(input, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 55, in _execute_llm
    completion = await self.client.chat.completions.create(
  File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 1289, in create
    return await self._post(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1805, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1503, in request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1599, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Rate limit is exceeded. Try again in 5 seconds.'}}
[ERROR] 2024-07-08 01:46:57,347 - Claim Extraction Error
WARNING:graphrag.llm.base.rate_limiting_llm:Process failed to invoke LLM 1/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 0 seconds. Follow recommendation? True
[ERROR] 2024-07-08 01:46:57,829 - Error Invoking LLM
WARNING:graphrag.llm.base.rate_limiting_llm:Process failed to invoke LLM 2/10 attempts. Cause: rate limit exceeded, will retry. Recommended sleep for 0 seconds. Follow recommendation? True
[ERROR] 2024-07-08 01:46:58,662 - Error Invoking LLM
c0derm4n commented 1 week ago

and this is my deploy para:

{
  "GRAPHRAG_API_BASE": "https://ada002-eus2.openai.azure.com/",
  "GRAPHRAG_API_VERSION": "2023-12-01-preview",
  "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "ada002-eus2",
  "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002",
  "GRAPHRAG_LLM_DEPLOYMENT_NAME": "tcl-ai-France",
  "GRAPHRAG_LLM_MODEL": "gpt-4",
  "LOCATION": "East US 2",
  "RESOURCE_GROUP": "graph_rag2"
}

May I ask if the parameter GRAPHRAG-API-BASE is set correctly? I don't quite understand the meaning of GRAPHRAG-API, so I used the endpoint of my embedded model

eyast commented 1 week ago

hi @c0derm4n - yes, that's the correct syntax for GRAPHRAG_API_BASE (See Josh's reply above). The error you shared seem to indicate that you are reaching a rate limit. You can modify the rate used by the accelerator by modifying this file.

On Azure, you can increase the quota for each model by following the instructions here: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota?tabs=rest

I advise you first make modifications on Azure, and keep the local YAML file intact. Only if you confirm you have access to more TPMs, should you increase the YAML file. Otherwise, when the Azure TPM are less than the value in your config file, you will hit a rate limit error (an HTTP Error 429).

chiara89 commented 1 week ago

Hello, I am having the same issue of the index job getting stucked. I am using the wikipedia articles provided. Here are my deployment parameters: { "GRAPHRAG_API_BASE": "https://checklistcreation.openai.azure.com/", "GRAPHRAG_API_VERSION": "2024-05-13", "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "Embedding", "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002", "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gtp4-0", "GRAPHRAG_LLM_MODEL": "gpt-4o", "LOCATION": "UK South", "RESOURCE_GROUP": "Checklist_creation" } Here is the log: [ERROR] 2024-07-11 08:25:49,090 - Entity Extraction Error [ERROR] 2024-07-11 08:25:49,538 - Error Invoking LLM ERROR:root:error extracting graph Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 118, in call result = await self._process_document(text, prompt_variables) File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 146, in _process_document response = await self._llm( File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/caching_llm.py", line 104, in call result = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 177, in call result, start = await execute_with_retry() File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry async for attempt in retryer: File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/init.py", line 166, in anext do = await self.iter(retry_state=self._retry_state) File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/init.py", line 153, in iter result = await action(retry_state) File "/usr/local/lib/python3.10/site-packages/tenacity/_utils.py", line 99, in inner return call(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/tenacity/init.py", line 398, in self._add_action_func(lambda rs: rs.outcome.result()) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 165, in execute_with_retry return await do_attempt(), start File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 147, in do_attempt return await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 49, in call return await self._invoke(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 53, in _invoke output = await self._execute_llm(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 55, in _execute_llm completion = await self.client.chat.completions.create( File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 1289, in create return await self._post( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1805, in post return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1503, in request return await self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1599, in _request raise self._make_status_error_from_response(err.response) from None openai.NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}

It seems it can't find the model but the deployment parameters should be correct. Any ideas on how to solve the problem?

Yushi-Saito commented 4 days ago

hi! is "gtp" right? "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gtp4-0",