Azure-Samples / graphrag-accelerator

One-click deploy of a Knowledge Graph powered RAG (GraphRAG) in Azure
https://github.com/microsoft/graphrag
MIT License
1.91k stars 315 forks source link

[BUG] my indexing always does not progress any more #154

Open fangnster opened 3 months ago

fangnster commented 3 months ago

my sample wikipedia articles are indexed and always 0.0% completed , how do I fix it?

screen shot for pot logs as follows: ########################################################## kubectl logs job/graphrag-index-manager-28738255 -n graphrag -f

Scheduling job for index: testindex [ERROR] 2024-08-22 02:58:32,367 - Index job manager encountered error scheduling indexing job Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '3da54996-302b-4b53-8550-eda0a9ca4ee3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4394828c-45ff-46b1-99c3-43de3fef08f8', 'X-Kubernetes-Pf-Prioritylevel-Uid': '95614f89-7a01-4064-bb56-9f052b3cb22f', 'Date': 'Thu, 22 Aug 2024 02:58:30 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5","group":"batch","kind":"jobs"},"code":409}

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '3da54996-302b-4b53-8550-eda0a9ca4ee3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4394828c-45ff-46b1-99c3-43de3fef08f8', 'X-Kubernetes-Pf-Prioritylevel-Uid': '95614f89-7a01-4064-bb56-9f052b3cb22f', 'Date': 'Thu, 22 Aug 2024 02:58:30 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-33b5e67636ee5ae3432d87c2cc8408d5","group":"batch","kind":"jobs"},"code":409}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 120, in main() File "/backend/manage-indexing-jobs.py", line 116, in main schedule_indexing_job(index_to_schedule) File "/backend/manage-indexing-jobs.py", line 55, in schedule_indexing_job pipeline_job["status"] = PipelineJobState.FAILED TypeError: 'PipelineJob' object does not support item assignment

rnpramasamyai commented 3 months ago

Error: indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" already exists

Please create a text file, add your own content to it, and index it with a new index name.

fangnster commented 3 months ago

I add another new index name in script and check status of indexing job, still 0.0% completed yet. After observe logs by the command “watch kubectl get jobs -n graphrag”, "indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" always exists, and not any new indexing job is created. How to kill it or start up another new one job ?

Error: indexing-job-33b5e67636ee5ae3432d87c2cc8408d5" already exists

Please create a text file, add your own content to it, and index it with a new index name.

rnpramasamyai commented 3 months ago

@fangnster Did you change index name and storage name in the 1-Quickstart.ipynb? image

fangnster commented 3 months ago

@fangnster Did you change index name and storage name in the 1-Quickstart.ipynb? image

yes , new index name and new storage name have been changed

rnpramasamyai commented 3 months ago

@@fangnster There may already be an index job running. Please check the status of the indexing job and whether the indexing pod is running.

fangnster commented 3 months ago

@@fangnster There may already be an index job running. Please check the status of the indexing job and whether the indexing pod is running.

this job has been running for several days, and status of this job always stay 0.0% completed for several days. How to fix it ?

image

this screen shot is the same , whatever index name and storage name are changed before and after

rnpramasamyai commented 3 months ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

fangnster commented 3 months ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs.

screenshot as follows: image

rnpramasamyai commented 3 months ago

@fangnster Please use the instructions below to retrieve logs from the pods. image

fangnster commented 3 months ago

@fangnster Please use the instructions below to retrieve logs from the pods. image

after study into these command, I restart a new indexing job with new storage and indexing file names. And then the same error "reason:conflict" occurs . At observation of that progress of that new indexing, I find a script "indexing-job-manage-template.yaml" as follows: image

Whether I can delay the 5 mins schedule to longer interval, such as 15mins etc., in order that the last indexing job has been processed completely . Could you tell me the reason of setting of the "5 mins" ?

timothymeyers commented 3 months ago

When you initiate an indexing job, a record of it is put into CosmosDB for the job and it is listed in a state of "Scheduled."

The K8s CronJob runs every 5 mins and checks CosmosDB for Scheduled indexing jobs, and then initiates actual indexing processes for them in order. It uses a k8s Job deployment for an indexing pod to be spun up (the indexing-<id> pod).

fangnster commented 3 months ago

When you initiate an indexing job, a record of it is put into CosmosDB for the job and it is listed in a state of "Scheduled."

The K8s CronJob runs every 5 mins and checks CosmosDB for Scheduled indexing jobs, and then initiates actual indexing processes for them in order. It uses a k8s Job deployment for an indexing pod to be spun up (the indexing-<id> pod).

Could you tell me how to modify the cronjob interval of 5 mins to longer one ?

timothymeyers commented 3 months ago

you can edit the template for the cron job by doing kubectl edit cj/graphrag-index-manager

and looking for the schedule: "*/5 * * * *" line. Change the number to a different number of minutes, and save the manifest.

Note that if you want to change it permanently between deployments, you'd change it in this file, and redeploy the backend container to Azure Container Registry.

fangnster commented 3 months ago

you can edit the template for the cron job by doing kubectl edit cj/graphrag-index-manager

and looking for the schedule: "*/5 * * * *" line. Change the number to a different number of minutes, and save the manifest.

Note that if you want to change it permanently between deployments, you'd change it in this file, and redeploy the backend container to Azure Container Registry.

image I try to solve the error by search_engine , but all failure as above screenshot

In addition, I redeploy a new this file in schedule "/15 " line, and then success. However, I check the cronjob by "kubectl describe cronjob", still former schedule "/5 " line

MeroZemory commented 3 months ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs.

screenshot as follows: image

I've been stuck in the scheduled 0.0% state for a long time too, but the cronjob that manages the indexing jobs (created from indexing-job-manager-template.yaml) runs every 5 minutes, which doesn't seem to have anything to do with the indexing processing time (you said it runs/shuts down every 5 minutes, but I understand that only the run is every 5 minutes).

My guess is that you're simply getting that error because the AKS job (indexing-job-*) didn't complete and no new indexing job was created, but the one that was already created is still running.

If there's a problem, it's probably the part where the indexing job doesn't complete and hangs. (I haven't had indexing complete in over 30 minutes either, but I'm not sure if it's in progress or hanging).

fangnster commented 3 months ago

@fangnster Please to stop or delete the index. There are many APIs available for deleting the index and storage. Please check your APIM.

I have deleted precious index and storage files, and restart another new file names, and then with look into running job of indexing , such as named "graphrag-index-manager-****" starts up one by one , which is killed automatically after 5 mins, and then restart another one. Therefore, the error shown on the first floor that is "reason:conflict" in logs occurs. screenshot as follows: image

I've been stuck in the scheduled 0.0% state for a long time too, but the cronjob that manages the indexing jobs (created from indexing-job-manager-template.yaml) runs every 5 minutes, which doesn't seem to have anything to do with the indexing processing time (you said it runs/shuts down every 5 minutes, but I understand that only the run is every 5 minutes).

My guess is that you're simply getting that error because the AKS job (indexing-job-*) didn't complete and no new indexing job was created, but the one that was already created is still running.

If there's a problem, it's probably the part where the indexing job doesn't complete and hangs. (I haven't had indexing complete in over 30 minutes either, but I'm not sure if it's in progress or hanging).

I found that the pod of indexing job was in hanging mostly and graphrag-index-manager start up every 5 mins and log error that include "already exists".

mb-porini commented 3 months ago

Hi! I hope to find a little help from the community. I'm trying to get the indexer from {scheduled, completion_percent: 0.0} to "running" but I cannot find the correct API call on the backend, I'm following the Quickstart.ipynb using the wikipedia dataset.

Am I missing anything?

Thanks!

timothymeyers commented 2 months ago

@mb-porini - if the index status is showing as {scheduled} then the backend API has done what it is going to do. There is a k8s CronJob that spins up every 5 minutes to check for indexing jobs in a {scheduled} state and will kick off a k8s Job to start the process. See if you can look at the Pod logs for recently completed CronJobs or Jobs. The indexer-<hash> pod is the one that would update the indexing status to {running}.

@fangnster - If the indexer-<hash> pod fails unexpectedly, you can get into a situation where the index state in CosmosDB is out of sync (it will say 'running' when it should be 'failed'). You can either (1) delete the index using the API (AdvancedStart notebook has code for this), (2) rerun the index with a new index name, (3) go into CosmosDB and manually update the status to read 'failed' and then restart the indexing job via the API again. (1) and (2) are considerably more straightforward than (3), I think. Also, for future reference, kubectl edit failed because you don't have vi or vim installed.

fangnster commented 2 months ago

@mb-porini - if the index status is showing as {scheduled} then the backend API has done what it is going to do. There is a k8s CronJob that spins up every 5 minutes to check for indexing jobs in a {scheduled} state and will kick off a k8s Job to start the process. See if you can look at the Pod logs for recently completed CronJobs or Jobs. The indexer-<hash> pod is the one that would update the indexing status to {running}.

@fangnster - If the indexer-<hash> pod fails unexpectedly, you can get into a situation where the index state in CosmosDB is out of sync (it will say 'running' when it should be 'failed'). You can either (1) delete the index using the API (AdvancedStart notebook has code for this), (2) rerun the index with a new index name, (3) go into CosmosDB and manually update the status to read 'failed' and then restart the indexing job via the API again. (1) and (2) are considerably more straightforward than (3), I think. Also, for future reference, kubectl edit failed because you don't have vi or vim installed.

Hi, thanks for your kind multi-responses to my question. Additionally, I try to execute (1) and (2) in several times, the error as same as usual , that is below screen shot and :

image

KubaBir commented 2 months ago

I have the same problem. The index-manager kicks off, starts an indexing-job and later crashes due the the duplicate name (trying to queue the same job again?). The job remains in pending state and the manager in CrashLoopBackOff

I depolyed the repo from vs code container and am using gpt-4o-mini image

Here is the whole log from index-manager

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job
    batch_v1.create_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '100823ba-e15d-47a0-a92f-d08284067eba', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '563bf5c9-3718-45a1-837b-473fd3f842d9', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e1bfb844-2c87-4f34-9901-7895c0dc1c2b', 'Date': 'Fri, 30 Aug 2024 13:50:41 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-3e23e8160039594a33894f6564e1b134\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-3e23e8160039594a33894f6564e1b134","group":"batch","kind":"jobs"},"code":409}

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job
    batch_v1.create_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '100823ba-e15d-47a0-a92f-d08284067eba', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '563bf5c9-3718-45a1-837b-473fd3f842d9', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e1bfb844-2c87-4f34-9901-7895c0dc1c2b', 'Date': 'Fri, 30 Aug 2024 13:50:41 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-3e23e8160039594a33894f6564e1b134\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-3e23e8160039594a33894f6564e1b134","group":"batch","kind":"jobs"},"code":409}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/backend/manage-indexing-jobs.py", line 120, in <module>
    main()
  File "/backend/manage-indexing-jobs.py", line 116, in main
    schedule_indexing_job(index_to_schedule)
  File "/backend/manage-indexing-jobs.py", line 55, in schedule_indexing_job
    pipeline_job["status"] = PipelineJobState.FAILED
TypeError: 'PipelineJob' object does not support item assignment

I also tried removing the index via the api (its removed along with the storage container) and changing the names of both the storage container and the index

alopezcruz commented 1 week ago

Did anyone found a solution for this ?, I'm stuck int he same situation ' job always stay 0.0% completed for last two days and no errors'..thank you in advance

mb-porini commented 6 days ago

Hi @alopezcruz,

I have to be clear, I made a very big mistake. During the configuration of the environment I decided to use a different compute size due to my subscription limitation. I figured out a little bit later that in the YAML configuration file there are some minimum requirements that has to be reached. So my problem has be solved completing the modified configuration. After that it worked perfectly.

Thanks for asking

alopezcruz commented 6 days ago

hi @mb-porini ,

Thank you for the reply, can you guide me with 'YAML  configuration file' , where this file in the repo?

Best,

mb-porini commented 5 days ago

Well, there is a very good amount of yaml files so I suggest you to modify them only if you are currently aware of how to handle them. Moreover, I'm quite sure that the main file was this. Good luck

timothymeyers commented 5 days ago

Check your Kubernetes cluster and see if you have an indexing pod running. Check the pod log streams for any errors.

How big of a dataset are you working with?

On Nov 20, 2024, at 9:09 PM, alopezcruz @.***> wrote:



hi @mb-porinihttps://github.com/mb-porini ,

Thank you for the reply, can you guide me with 'YAML configuration file' , where this file in the repo?

Best,

— Reply to this email directly, view it on GitHubhttps://github.com/Azure-Samples/graphrag-accelerator/issues/154#issuecomment-2489919245 or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJKW6UJSNJ7DWZVZXIRMR32BU6FVBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI3DONBZGIZTQOBXGSSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJXG44DINJRHEYTBAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDENBXHE3DOMJXGEZIFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI3DONBZGIZTQOBXGSTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alopezcruz commented 5 days ago

Check your Kubernetes cluster and see if you have an indexing pod running. Check the pod log streams for any errors. How big of a dataset are you working with? On Nov 20, 2024, at 9:09 PM, alopezcruz @.***> wrote:  hi @mb-porinihttps://github.com/mb-porini , Thank you for the reply, can you guide me with 'YAML configuration file' , where this file in the repo? Best, — Reply to this email directly, view it on GitHub<#154 (comment)> or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJKW6UJSNJ7DWZVZXIRMR32BU6FVBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI3DONBZGIZTQOBXGSSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJXG44DINJRHEYTBAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDENBXHE3DOMJXGEZIFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI3DONBZGIZTQOBXGSTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you commented on the thread. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Hi @timothymeyers , @mb-porini

Thank you for the reply, appreciated, below are the logs from the index , also @timothymeyers , the dataset being used is the one of the sample from Wikipedia 'California' :

Scheduling job for index: TestCalifornia [ERROR] 2024-11-21 15:15:13,384 - Index job manager encountered error scheduling indexing job Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '66821845-5b25-4c2e-9789-28a9a888d6ed', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '92193793-5a4f-4190-b374-38aba384c2b0', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'f6b2ffef-9aad-4623-835b-fc9919c36491', 'Date': 'Thu, 21 Nov 2024 15:15:12 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-6c416233b5012abcfea8417472538ad1\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-6c416233b5012abcfea8417472538ad1","group":"batch","kind":"jobs"},"code":409}

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 43, in schedule_indexing_job batch_v1.create_namespaced_job( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (409) Reason: Conflict HTTP response headers: HTTPHeaderDict({'Audit-Id': '66821845-5b25-4c2e-9789-28a9a888d6ed', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '92193793-5a4f-4190-b374-38aba384c2b0', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'f6b2ffef-9aad-4623-835b-fc9919c36491', 'Date': 'Thu, 21 Nov 2024 15:15:12 GMT', 'Content-Length': '290'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"indexing-job-6c416233b5012abcfea8417472538ad1\" already exists","reason":"AlreadyExists","details":{"name":"indexing-job-6c416233b5012abcfea8417472538ad1","group":"batch","kind":"jobs"},"code":409}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/backend/manage-indexing-jobs.py", line 120, in main() File "/backend/manage-indexing-jobs.py", line 116, in main schedule_indexing_job(index_to_schedule) File "/backend/manage-indexing-jobs.py", line 55, in schedule_indexing_job pipeline_job["status"] = PipelineJobState.FAILED TypeError: 'PipelineJob' object does not support item assignment

Please let me know your comments, thank you again for your time..

alopezcruz commented 4 days ago

Hi @timothymeyers, @mb-porini,

I did figure this out, you advise put me on the right track, I leave below what I found in case others face similar issue. In Azure RG while auditing the Kubernetes service resource, node pools (index, graph rag, agent), I noticed some scaling warnings (related to index node), so going into 'Node pools' -> 'index node' and opening the 'scale node pool' noticed the machine or VM (Standard E8s v5 east us) have 0 quotas available and off course not able scale this node at all ( manually or auto scaling ,you should test adding a minimum node count). Finally, after requesting the quotas VM (Standard E8s v5 east us) and quotas were applied, everything when smooth and all indexing jobs were completed to 100%. May be this can be included in the installation or documentation, warning about quotas for specific nodes, etc., my installation was smooth but never gave me a warning or stop me about this. Anyway, thank you again for your support. Best,