Azure-Samples / graphrag-accelerator

One-click deploy of a Knowledge Graph powered RAG (GraphRAG) in Azure
https://github.com/microsoft/graphrag
MIT License
1.65k stars 250 forks source link

[BUG]Checking for GraphRAG availability.................... Failed: error #120

Closed pwine123 closed 1 month ago

pwine123 commented 1 month ago

Describe the bug During deployment, i get the error: Checking for GraphRAG availability.................... Failed.

To Reproduce I looked at the aks pod logs: graphrag-query and graphrag-index had warnings with the following message "NotTriggerScaleUp: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 Insufficient cpu"

"FailedScheduling: 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.." image

FYI - due to the deployment error i encountered, i had to update aks.bicep to use "standard_d4s_v3" in line 33 and line 41. image

Not sure if this is causing the insufficient cpu error. I tried by updating the max node counts to higher number in aks.bicep file but continue to run into the same insufficient cpu error.

Expected behavior Expecting deployment to succeed.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

timothymeyers commented 1 month ago

Hey @pwine123 - I don't think that the d4s_v3 VMs have enough cpu/memory given the resource requests in the helm chart for the graph rag deployment - see this file under both of the resource sections. While the d4s_v3s seem to meet the minimum, it is only just the minimum, and I'm guessing other things running on the nodes are keeping the pods from acquiring the resources they are required to have.

Perhaps look for another e16as VM sku for the graphrag node pool? Or adjust down the resources requests in values.yml (though I'm not sure how that will impact performace).

For what it's worth, I only had to ever adjust the first "System Virtual Machine" sku to get AKS to deploy, but everyone's quotas may vary.

rnpramasamyai commented 1 month ago

@pwine123 You are experiencing an insufficient CPU issue. Please request an increase in CPU resources.

pwine123 commented 1 month ago

After moving graph rag node pool to e16as VM sku, the deployment worked. Thanks @timothymeyers and @rnpramasamyai