Azure-Samples / graphrag-accelerator

One-click deploy of a Knowledge Graph powered RAG (GraphRAG) in Azure
https://github.com/microsoft/graphrag
MIT License
1.65k stars 250 forks source link

[BUG] my indexing always does not progress any more #154

Open fangnster opened 3 weeks ago

fangnster commented 3 weeks ago

my sample wikipedia articles are indexed and always 0.0% completed , how do I fix it?

screen shot for pot logs as follows: ########################################################## kubectl logs job/graphrag-index-manager-28738255 -n graphrag -f

Scheduling job for index: testindex [ERROR] 2024-08-22 02:58:32,367 - Index job manager encountered error scheduling indexing job Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '3da54996-302b-4b53-8550-eda0a9ca4ee3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4394828c-45ff-46b1-99c3-43de3fef08f8', 'X-Kubernetes-Pf-Prioritylevel-Uid': '95614f89-7a01-4064-bb56-9f052b3cb22f', 'Date': 'Thu, 22 Aug 2024 02:58:30 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5","group":"batch","kind":"jobs"},"code":409}

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '3da54996-302b-4b53-8550-eda0a9ca4ee3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4394828c-45ff-46b1-99c3-43de3fef08f8', 'X-Kubernetes-Pf-Prioritylevel-Uid': '95614f89-7a01-4064-bb56-9f052b3cb22f', 'Date': 'Thu, 22 Aug 2024 02:58:30 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5","group":"batch","kind":"jobs"},"code":409}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 120, in main() File "/backend/manage-indexing-jobs.py", line 116, in main schedule_indexing_job(index_to_schedule) File "/backend/manage-indexing-jobs.py", line 55, in schedule_indexing_job pipeline_job["status"] = PipelineJobState.FAILED TypeError: 'PipelineJob' object does not support item assignment

rnpramasamyai commented 3 weeks ago

Error: indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" already exists

Please create a text file, add your own content to it, and index it with a new index name.

fangnster commented 3 weeks ago

I add another new index name in script and check status of indexing job, still 0.0% completed yet. After observe logs by the command “watch kubectl get jobs -n graphrag”, "indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" always exists, and not any new indexing job is created. How to kill it or start up another new one job ?

Error: indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" already exists

Please create a text file, add your own content to it, and index it with a new index name.

rnpramasamyai commented 3 weeks ago

@fangnster Did you change index name and storage name in the 1-Quickstart.ipynb? image

fangnster commented 3 weeks ago

@fangnster Did you change index name and storage name in the 1-Quickstart.ipynb? image

yes , new index name and new storage name have been changed

rnpramasamyai commented 3 weeks ago

@@fangnster There may already be an index job running. Please check the status of the indexing job and whether the indexing pod is running.

fangnster commented 3 weeks ago

@@fangnster There may already be an index job running. Please check the status of the indexing job and whether the indexing pod is running.

this job has been running for several days, and status of this job always stay 0.0% completed for several days. How to fix it ?

image

this screen shot is the same , whatever index name and storage name are changed before and after

rnpramasamyai commented 3 weeks ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

fangnster commented 3 weeks ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs.

screenshot as follows: image

rnpramasamyai commented 3 weeks ago

@fangnster Please use the instructions below to retrieve logs from the pods. image

fangnster commented 2 weeks ago

@fangnster Please use the instructions below to retrieve logs from the pods. image

after study into these command, I restart a new indexing job with new storage and indexing file names. And then the same error "reason:conflict" occurs . At observation of that progress of that new indexing, I find a script "indexing-job-manage-template.yaml" as follows: image

Whether I can delay the 5 mins schedule to longer interval, such as 15mins etc., in order that the last indexing job has been processed completely . Could you tell me the reason of setting of the "5 mins" ?

timothymeyers commented 2 weeks ago

When you initiate an indexing job, a record of it is put into CosmosDB for the job and it is listed in a state of "Scheduled."

The K8s CronJob runs every 5 mins and checks CosmosDB for Scheduled indexing jobs, and then initiates actual indexing processes for them in order. It uses a k8s Job deployment for an indexing pod to be spun up (the indexing-<id> pod).

fangnster commented 2 weeks ago

When you initiate an indexing job, a record of it is put into CosmosDB for the job and it is listed in a state of "Scheduled."

The K8s CronJob runs every 5 mins and checks CosmosDB for Scheduled indexing jobs, and then initiates actual indexing processes for them in order. It uses a k8s Job deployment for an indexing pod to be spun up (the indexing-<id> pod).

Could you tell me how to modify the cronjob interval of 5 mins to longer one ?

timothymeyers commented 2 weeks ago

you can edit the template for the cron job by doing kubectl edit cj/graphrag-index-manager

and looking for the schedule: "*/5 * * * *" line. Change the number to a different number of minutes, and save the manifest.

Note that if you want to change it permanently between deployments, you'd change it in this file, and redeploy the backend container to Azure Container Registry.

fangnster commented 2 weeks ago

you can edit the template for the cron job by doing kubectl edit cj/graphrag-index-manager

and looking for the schedule: "*/5 * * * *" line. Change the number to a different number of minutes, and save the manifest.

Note that if you want to change it permanently between deployments, you'd change it in this file, and redeploy the backend container to Azure Container Registry.

image I try to solve the error by search_engine , but all failure as above screenshot

In addition, I redeploy a new this file in schedule "/15 " line, and then success. However, I check the cronjob by "kubectl describe cronjob", still former schedule "/5 " line

MeroZemory commented 2 weeks ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs.

screenshot as follows: image

I've been stuck in the scheduled 0.0% state for a long time too, but the cronjob that manages the indexing jobs (created from indexing-job-manager-template.yaml) runs every 5 minutes, which doesn't seem to have anything to do with the indexing processing time (you said it runs/shuts down every 5 minutes, but I understand that only the run is every 5 minutes).

My guess is that you're simply getting that error because the AKS job (indexing-job-*) didn't complete and no new indexing job was created, but the one that was already created is still running.

If there's a problem, it's probably the part where the indexing job doesn't complete and hangs. (I haven't had indexing complete in over 30 minutes either, but I'm not sure if it's in progress or hanging).

fangnster commented 2 weeks ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs. screenshot as follows: image

I've been stuck in the scheduled 0.0% state for a long time too, but the cronjob that manages the indexing jobs (created from indexing-job-manager-template.yaml) runs every 5 minutes, which doesn't seem to have anything to do with the indexing processing time (you said it runs/shuts down every 5 minutes, but I understand that only the run is every 5 minutes).

My guess is that you're simply getting that error because the AKS job (indexing-job-*) didn't complete and no new indexing job was created, but the one that was already created is still running.

If there's a problem, it's probably the part where the indexing job doesn't complete and hangs. (I haven't had indexing complete in over 30 minutes either, but I'm not sure if it's in progress or hanging).

I found that the pod of indexing job was in hanging mostly and graphrag-index-manager start up every 5 mins and log error that include "already exists".

mb-porini commented 2 weeks ago

Hi! I hope to find a little help from the community. I'm trying to get the indexer from {scheduled, completion_percent: 0.0} to "running" but I cannot find the correct API call on the backend, I'm following the Quickstart.ipynb using the wikipedia dataset.

Am I missing anything?

Thanks!

timothymeyers commented 2 weeks ago

@mb-porini - if the index status is showing as {scheduled} then the backend API has done what it is going to do. There is a k8s CronJob that spins up every 5 minutes to check for indexing jobs in a {scheduled} state and will kick off a k8s Job to start the process. See if you can look at the Pod logs for recently completed CronJobs or Jobs. The indexer-<hash> pod is the one that would update the indexing status to {running}.

@fangnster - If the indexer-<hash> pod fails unexpectedly, you can get into a situation where the index state in CosmosDB is out of sync (it will say 'running' when it should be 'failed'). You can either (1) delete the index using the API (AdvancedStart notebook has code for this), (2) rerun the index with a new index name, (3) go into CosmosDB and manually update the status to read 'failed' and then restart the indexing job via the API again. (1) and (2) are considerably more straightforward than (3), I think. Also, for future reference, kubectl edit failed because you don't have vi or vim installed.

fangnster commented 2 weeks ago

@mb-porini - if the index status is showing as {scheduled} then the backend API has done what it is going to do. There is a k8s CronJob that spins up every 5 minutes to check for indexing jobs in a {scheduled} state and will kick off a k8s Job to start the process. See if you can look at the Pod logs for recently completed CronJobs or Jobs. The indexer-<hash> pod is the one that would update the indexing status to {running}.

@fangnster - If the indexer-<hash> pod fails unexpectedly, you can get into a situation where the index state in CosmosDB is out of sync (it will say 'running' when it should be 'failed'). You can either (1) delete the index using the API (AdvancedStart notebook has code for this), (2) rerun the index with a new index name, (3) go into CosmosDB and manually update the status to read 'failed' and then restart the indexing job via the API again. (1) and (2) are considerably more straightforward than (3), I think. Also, for future reference, kubectl edit failed because you don't have vi or vim installed.

Hi, thanks for your kind multi-responses to my question. Additionally, I try to execute (1) and (2) in several times, the error as same as usual , that is below screen shot and :

image

KubaBir commented 2 weeks ago

I have the same problem. The index-manager kicks off, starts an indexing-job and later crashes due the the duplicate name (trying to queue the same job again?). The job remains in pending state and the manager in CrashLoopBackOff

I depolyed the repo from vs code container and am using gpt-4o-mini image

Here is the whole log from index-manager

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job
    batch_v1.create_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '100823ba-e15d-47a0-a92f-d08284067eba', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '563bf5c9-3718-45a1-837b-473fd3f842d9', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e1bfb844-2c87-4f34-9901-7895c0dc1c2b', 'Date': 'Fri, 30 Aug 2024 13:50:41 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-3e23e8160039594a33894f6564e1b134\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-3e23e8160039594a33894f6564e1b134","group":"batch","kind":"jobs"},"code":409}

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job
    batch_v1.create_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '100823ba-e15d-47a0-a92f-d08284067eba', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '563bf5c9-3718-45a1-837b-473fd3f842d9', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e1bfb844-2c87-4f34-9901-7895c0dc1c2b', 'Date': 'Fri, 30 Aug 2024 13:50:41 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-3e23e8160039594a33894f6564e1b134\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-3e23e8160039594a33894f6564e1b134","group":"batch","kind":"jobs"},"code":409}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 120, in <module>
    main()
  File "/backend/manage-indexing-jobs.py", line 116, in main
    schedule_indexing_job(index_to_schedule)
  File "/backend/manage-indexing-jobs.py", line 55, in schedule_indexing_job
    pipeline_job["status"] = PipelineJobState.FAILED
TypeError: 'PipelineJob' object does not support item assignment

I also tried removing the index via the api (its removed along with the storage container) and changing the names of both the storage container and the index