Impossible to deploy Phi-3 models in Azure Kubernetes Service

Azure / kaito

Kubernetes AI Toolchain Operator

MIT License

384 stars 43 forks source link

Impossible to deploy Phi-3 models in Azure Kubernetes Service #523

Closed rliberoff closed 1 week ago

rliberoff commented 1 month ago

Describe the bug

I'm trying to deploy a Phi-3 model in AKS, but every time I try to deploy the workspace, I get the following error:

kaito-rag/workspace-phi-3-medium-4k-instruct failed to run apply: error when creating "C:\\Users\\RLIBER~1.PHA\\AppData\\Local\\Temp\\275832899kubectl_manifest.yaml": Internal error occurred: failed calling webhook "validation.workspace.kaito.sh": failed to call webhook: Post "https://workspace-webhook-svc.kube-system.svc:9443/validate/workspace.kaito.sh?timeout=10s": EOF

The YAML definition is as follows:

    apiVersion: kaito.sh/v1alpha1
    kind: Workspace
    metadata:
      name: workspace-phi-3-medium-4k-instruct
      namespace: kaito-rag
      annotations:
        kaito.sh/enablelb: "False"
    resource:
      count: 1
      instanceType: "Standard_NC12s_v3"
      labelSelector:
        matchLabels:
          apps: phi-3-medium-4k-instruct
    inference:
      preset:
        name: "phi-3-medium-4k-instruct"

I'm using France Central as my Azure Region.

Please help!

Thank you!

Steps To Reproduce

I'm using Terraform to deploy the AKS. Everything is as expected, but once I execute the following command:

kubectl apply -f .\kaito_workspace_phi-3-medium-4k.yaml

I get the following error immediately:

Error from server (InternalError): error when creating ".\\kaito_workspace_phi-3-medium-4k.yaml": Internal error occurred: failed calling webhook "validation.workspace.kaito.sh": failed to call webhook: Post "https://workspace-webhook-svc.kube-system.svc:9443/validate/workspace.kaito.sh?timeout=10s": EOF

Expected behavior

N/A

Logs

N/A

Environment

Kubernetes version (use kubectl version): kubectl version --> 1.29.2
AKS Kubernetes version: 1.29.5
OS (e.g: cat /etc/os-release): Windows 11
Install tools: Terraform for the AKS, kubectl for the Kaito model

Additional context

N/A

ishaansehgal99 commented 1 month ago

Ok I just ran this locally and didnt have any issues. Let's confirm a couple things @rliberoff

The correct image is being used. You should git checkout tag 0.3.0 on repo to be sure - that is official release version.

e.g.

git fetch --tags
git checkout tags/v0.3.0

The right kaito image for phi-3 support is mcr.microsoft.com/aks/kaito/workspace:0.3.0. Is this image being used? You can check this by kubectl describe the kaito-workspace pod (should be in the kaito-workspace namespace).

Should show Successfully pulled image "mcr.microsoft.com/aks/kaito/workspace:0.3.0"

Check the logs of your workspace pod and share them here if you see errors.

Fei-Guo commented 1 month ago

Can you paste the workspace controller log? It should tell whether the webhook server has problem or not.

rliberoff commented 1 month ago

Hi @ishaansehgal99 and @Fei-Guo ,

Thank your answer. Please allow me a few days to get this information. Thank you!

rliberoff commented 1 month ago

Hi @ishaansehgal99 and @Fei-Guo,

I was able to reproduce the error again using the AZ CLI version 2.62.0 with the aks-preview extension enabled in PowerShell on Windows 11, and following the steps documented here → https://learn.microsoft.com/en-us/azure/aks/ai-toolchain-operator

BTW, I have the AIToolchainOperatorPreview feature flag enabled in the subscription:

```json
{
    "id": "/subscriptions/…/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/AIToolchainOperatorPreview",
    "name": "Microsoft.ContainerService/AIToolchainOperatorPreview",
    "properties": {
        "state": "Registered"
    },
"type": "Microsoft.Features/providers/features"
}
```

The region I'm deploying the resources is France Central.

Here are the steps I have just perform to reproduce the error:

Created the variables:

$AZURE_SUBSCRIPTION_ID="c93…354"
$AZURE_RESOURCE_GROUP="rg-relv-test-kaito"
$AZURE_LOCATION="francecentral"
$CLUSTER_NAME="aks-relv-test-kaito"

Create an Azure resource group:

az group create --name $AZURE_RESOURCE_GROUP --location $AZURE_LOCATION

Create an AKS cluster with the AI toolchain operator add-on enabled

az aks create --location $AZURE_LOCATION --resource-group $AZURE_RESOURCE_GROUP --name $CLUSTER_NAME --enable-oidc-issuer --enable-ai-toolchain-operator --generate-ssh-keys

Configure kubectl to connect to the new cluster:

az aks get-credentials --resource-group $AZURE_RESOURCE_GROUP --name $CLUSTER_NAME

Then verified the connection to my cluster:

kubectl get nodes

Output:
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-20454229-vmss000000   Ready    agent   3m26s   v1.28.10
aks-nodepool1-20454229-vmss000001   Ready    agent   3m29s   v1.28.10
aks-nodepool1-20454229-vmss000002   Ready    agent   3m6s    v1.28.10

Export environment variables for the MC resource group, principal ID identity, KAITO identity, and the AKS OIDC Issuer URL:

$MC_RESOURCE_GROUP=$(az aks show --resource-group $AZURE_RESOURCE_GROUP --name $CLUSTER_NAME --query nodeResourceGroup -o tsv)
$PRINCIPAL_ID=$(az identity show --name "ai-toolchain-operator-$CLUSTER_NAME" --resource-group "$MC_RESOURCE_GROUP" --query 'principalId' -o tsv)
$KAITO_IDENTITY_NAME="ai-toolchain-operator-$CLUSTER_NAME"
$AKS_OIDC_ISSUER=$(az aks show --resource-group "$AZURE_RESOURCE_GROUP" --name "$CLUSTER_NAME" --query "oidcIssuerProfile.issuerUrl" -o tsv)

Create a new role assignment for the service principal

az role assignment create --role "Contributor" --assignee "$PRINCIPAL_ID" --scope "/subscriptions/c93…354/resourcegroups/$AZURE_RESOURCE_GROUP"

Create the federated identity credential between the managed identity, AKS OIDC issuer, and subject:

az identity federated-credential create --name "kaito-federated-identity" --identity-name "$KAITO_IDENTITY_NAME" -g "$MC_RESOURCE_GROUP" --issuer "$AKS_OIDC_ISSUER" --subject system:serviceaccount:"kube-system:kaito-gpu-provisioner" --audience api://AzureADTokenExchange

Verified that the deployment is running:

kubectl rollout restart deployment/kaito-gpu-provisioner -n kube-system

Output:
deployment.apps/kaito-gpu-provisioner restarted

And then again:

kubectl get deployment -n kube-system | grep kaito

Output:
kaito-gpu-provisioner   1/1     1            1           8m24s
kaito-workspace         1/1     1            1           8m24s

This is the last step, and the one that produced an error:

kubectl apply -f https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_phi_3.yaml

The error is:

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/Azure/kaito/main/examples/inference/kaito_workspace_phi_3.yaml": Internal error occurred: failed calling webhook "validation.workspace.kaito.sh": failed to call webhook: Post "https://workspace-webhook-svc.kube-system.svc:9443/validate/workspace.kaito.sh?timeout=10s": EOF

Looking into the AKS, and following the advice from @ishaansehgal99, I checked the kaito-workspace pod, and it seems that it is using the mcr.microsoft.com/aks/kaito/workspace:0.2.2 instead of the mcr.microsoft.com/aks/kaito/workspace:0.3.0.

So it seems the issue is with AKS not being updated (perhaps in the francecentral region) with the latest release of Kaito.

Do you know if there is a way to upgrade the workspace to the 0.3.0 version?

And if so, could you please guide me with the necessary steps? Sadly I'm a newbie in Kubernetes and AKS.

Your help and assistance is much appreciated.

Thank you!

rliberoff commented 1 month ago

So, no matter what I do... to force or something settting the AKS deployment to use mcr.microsoft.com/aks/kaito/workspace:0.3.0. it eventually goes back to mcr.microsoft.com/aks/kaito/workspace:0.2.2 😢

Is there a way to tell AKS to use mcr.microsoft.com/aks/kaito/workspace:0.3.0 instead of mcr.microsoft.com/aks/kaito/workspace:0.2.2?

Thank you.

Fei-Guo commented 1 month ago

@rliberoff, you are using AKS managed kaito addon. We have not released 0.3.0 in AKS addon yet. If you want to use phi3, please using the upstream chart installation guide and install an upstream version for now. https://github.com/Azure/kaito/blob/main/docs/installation.md

rliberoff commented 1 month ago

Hi @Fei-Guo,

I see. Ok, I need to see how to translate the instructions from those instructions into a Terraform script. At the end of the day, we are deploying the whole solution we're building with terraform.

Thank you.

rliberoff commented 1 month ago

Hi @Fei-Guo and @ishaansehgal99 ,

I was able to deploy Kaito version 0.3.0 using a Terraform based on the documentation you suggested (👉🏻 https://github.com/Azure/kaito/blob/main/docs/installation.md)

But, I'm not being able to create a Phi-3 medium model.

My YAML is as follows:

    apiVersion: kaito.sh/v1alpha1
    kind: Workspace
    metadata:
      name: workspace-phi-3-medium
      namespace: "kaito-rag"
      annotations:
        kaito.sh/enablelb: "False"
    resource:
      count: 1
      instanceType: "Standard_NC12s_v3"
      labelSelector:
        matchLabels:
          apps: phi-3
    inference:
      preset:
        name: "phi-3-medium-4k-instruct"

What I'm doing wrong?

Is Phi-3 medium supported?

I though it was based on the code from: https://github.com/Azure/kaito/blob/f259329a4e1cff3d1f5a7846c89733619e2e9d4a/presets/models/phi3/model.go#L34-L37

Thank you!

Fei-Guo commented 1 month ago

@rliberoff, what was the error? Is the inference deployment created or not? if it is created, please share the container log of the inference pod. If the deployment is not created, please check workspace status, especially, is there a GPU node (Standard_NC12s_v3) created successfully?

rliberoff commented 1 month ago

BTW, also the gpu-provider deployed using helm as described in the documentation is not starting. In AKS it looks like this:

Getting a describe of the pod shows this information:

Name:                 gpu-provisioner-57d5c4959b-lkwkx
Namespace:            gpu-provisioner
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      gpu-provisioner
Node:                 aks-user-38844619-vmss000000/10.241.0.4
Start Time:           Thu, 25 Jul 2024 18:54:52 +0200
Labels:               app.kubernetes.io/instance=gpu-provisioner
                      app.kubernetes.io/name=gpu-provisioner
                      azure.workload.identity/use=true
                      pod-template-hash=57d5c4959b
Annotations:          checksum/settings: 50517b08c8328802043c3fdcb348c2a8847cc84c62f86242db7fe59824f0ba83
                      kubectl.kubernetes.io/restartedAt: 2024-07-25T18:54:51+02:00
Status:               Running
IP:                   172.0.3.167
IPs:
  IP:           172.0.3.167
Controlled By:  ReplicaSet/gpu-provisioner-57d5c4959b
Containers:
  controller:
    Container ID:   containerd://d6d8cd513dae110dd7770e86638a5dba1cd36460169dd7e1f9d9776f8c1456f5
    Image:          mcr.microsoft.com/aks/kaito/gpu-provisioner:0.2.0
    Image ID:       mcr.microsoft.com/aks/kaito/gpu-provisioner@sha256:1204a7e948e9a5efbe14561e14ed6fb0bc5936aaf787e870bd6416da5b584874
    Port:           8081/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 25 Jul 2024 18:57:55 +0200
      Finished:     Thu, 25 Jul 2024 18:57:58 +0200
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:  500m
    Requests:
      cpu:      200m
    Liveness:   http-get http://:http/healthz delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/readyz delay=5s timeout=30s period=10s #success=1 #failure=3
    Environment:
      CONFIG_LOGGING_NAME:         gpu-provisioner-config-logging
      SYSTEM_NAMESPACE:            gpu-provisioner (v1:metadata.namespace)
      ARM_SUBSCRIPTION_ID:         c93dfe1e-224e-4aad-a8b6-6624b4537354
      LOCATION:                    francecentral
      AZURE_CLUSTER_NAME:          aks-kaito-rag-47ed0
      AZURE_NODE_RESOURCE_GROUP:   MC_rg-kaito-rag-47ed0_aks-kaito-rag-47ed0_francecentral
      ARM_RESOURCE_GROUP:          rg-kaito-rag-47ed0
      LEADER_ELECT:                false
      E2E_TEST_MODE:               false
      AZURE_CLIENT_ID:             f99ece97-9543-4263-93cf-abc904a1ee9e
      AZURE_TENANT_ID:             80c...-...-...-...-...4cd
      AZURE_FEDERATED_TOKEN_FILE:  /var/run/secrets/azure/tokens/azure-identity-token
      AZURE_AUTHORITY_HOST:        https://login.microsoftonline.com/
    Mounts:
      /var/run/secrets/azure/tokens from azure-identity-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5d9cl (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-5d9cl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  azure-identity-token:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3600
QoS Class:                    Burstable
Node-Selectors:               kubernetes.io/os=linux
Tolerations:                  CriticalAddonsOnly op=Exists
                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=gpu-provisioner,app.kubernetes.io/name=gpu-provisioner
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m51s                  default-scheduler  Successfully assigned gpu-provisioner/gpu-provisioner-57d5c4959b-lkwkx to aks-user-38844619-vmss000000
  Normal   Pulling    4m51s                  kubelet            Pulling image "mcr.microsoft.com/aks/kaito/gpu-provisioner:0.2.0"
  Normal   Pulled     4m49s                  kubelet            Successfully pulled image "mcr.microsoft.com/aks/kaito/gpu-provisioner:0.2.0" in 1.113s (1.113s including waiting)
  Warning  BackOff    3m25s (x9 over 4m42s)  kubelet            Back-off restarting failed container controller in pod gpu-provisioner-57d5c4959b-lkwkx_gpu-provisioner(bc9a5c1e-ec33-4419-a6a8-a4e976f0b8f4)
  Normal   Created    3m11s (x5 over 4m49s)  kubelet            Created container controller
  Normal   Started    3m11s (x5 over 4m49s)  kubelet            Started container controller
  Normal   Pulled     3m11s (x4 over 4m45s)  kubelet            Container image "mcr.microsoft.com/aks/kaito/gpu-provisioner:0.2.0" already present on machine

I'd really appreciate any help to get this running on AKS.

Thank you.

Fei-Guo commented 1 month ago

Have you setup the workload identity? note that before finishing this step, the gpu-provisioner controller pod will constantly fail with the following message in the log.....

Can you show the log of the gpu-provisioner pod?

rliberoff commented 1 month ago

Hi @Fei-Guo,

Yes, I just - a few minutes ago - were able to access the log of the gpu-provisioner and there was an issue with the federated identity. I'm now trying to fix it.

On the other hand, could you please tell me if Phi-3 Medium is currently supported by 0.3.0. It seems it is not 🫤

Thank you.

Fei-Guo commented 1 month ago

https://github.com/Azure/kaito/blob/66f57116fa9827a13c023d57b300928ecd2ce640/presets/models/supported_models.yaml#L115 It is supported, we should have a model image ready for phi3-medium. I found the model name is missing in the doc https://github.com/Azure/kaito/tree/main/presets/models/phi3, will fix it.

rliberoff commented 1 month ago

Hi @Fei-Guo,

So, finally I was able to deploy the phi-3 medium 😃

Sadly it takes forever to answer. In fact, I was unable to get an answer from it. I think I'm using a quite capable VM (the Standard_NC12s_v3) and yet doing a curl with a simple question as Tell me about Tuscany and its cities. never gets a response.

On the other hand, the phi-3 mini did work. I think I will try to adjust the prompt to get the expected response format from the phi-3 mini model and forget about the phi-3 medium.

Thank a lot for your help!

rliberoff commented 1 month ago

Hi @Fei-Guo and @ishaansehgal99,

The following is the kubectl describe pod of pod containing a phi-3-medium model.

The thing is that everything deploys successfully, but the model never answers a question.

This is the description:

Name:             kaito-workspace-phi-3-medium-4k-instruct-5d685658b9-dzqfw
Namespace:        kaito-rag
Priority:         0
Service Account:  default
Node:             aks-ws3e3bbb692-12520279-vmss000000/10.240.0.7
Start Time:       Mon, 29 Jul 2024 19:57:18 +0200
Labels:           kaito.sh/workspace=kaito-workspace-phi-3-medium-4k-instruct
                  pod-template-hash=5d685658b9
Annotations:      <none>
Status:           Running
IP:               172.0.4.208
IPs:
  IP:           172.0.4.208
Controlled By:  ReplicaSet/kaito-workspace-phi-3-medium-4k-instruct-5d685658b9
Containers:
  kaito-workspace-phi-3-medium-4k-instruct:
    Container ID:  containerd://60c28966a1cfb3fa109acecd67d638aa21ae36d93a523b61b71b365f36ae25be
    Image:         mcr.microsoft.com/aks/kaito/kaito-phi-3-medium-4k-instruct:0.0.1
    Image ID:      mcr.microsoft.com/aks/kaito/kaito-phi-3-medium-4k-instruct@sha256:c106975f8e09a03d32118c7fff6f690ab705587dcdb83b2aad41d3c9ed30b740
    Port:          5000/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
      accelerate launch --gpu_ids=all --num_processes=1 --num_machines=1 --machine_rank=0 inference_api.py --torch_dtype=auto --pipeline=text-generation --trust_remote_code
    State:          Running
      Started:      Mon, 29 Jul 2024 20:03:14 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Liveness:          http-get http://:5000/healthz delay=600s timeout=1s period=10s #success=1 #failure=3
    Readiness:         http-get http://:5000/healthz delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fm5ql (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-fm5ql:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
                             sku=gpu:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  44m   default-scheduler  Successfully assigned kaito-rag/kaito-workspace-phi-3-medium-4k-instruct-5d685658b9-dzqfw to aks-ws3e3bbb692-12520279-vmss000000
  Normal  Pulling    44m   kubelet            Pulling image "mcr.microsoft.com/aks/kaito/kaito-phi-3-medium-4k-instruct:0.0.1"
  Normal  Pulled     38m   kubelet            Successfully pulled image "mcr.microsoft.com/aks/kaito/kaito-phi-3-medium-4k-instruct:0.0.1" in 5m55.332s (5m55.332s including waiting)
  Normal  Created    38m   kubelet            Created container kaito-workspace-phi-3-medium-4k-instruct
  Normal  Started    38m   kubelet            Started container kaito-workspace-phi-3-medium-4k-instruct

To test, I'm doing a port-forward and then the following curl:

curl -X POST http://localhost:5000/chat -H "accept: application/json" -H "Content-Type: application/json" -d '{"prompt":"Tell me about Tuscany and its cities.", "return_full_text": false, "generate_kwargs": {"max_length":4096}}'

The VM size used for this model is a Standard_NC12s_v3.

The logs in the pod are mostly INFO: 10.240.0.7:48044 - "GET /healthz HTTP/1.1" 200 OK, but there are a few with something different:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 6/6 [00:03<00:00,  1.96it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
INFO:     Started server process [19]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
Model: Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 5120, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-39): 40 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (qkv_proj): Linear(in_features=5120, out_features=7680, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=5120, out_features=35840, bias=False)
          (down_proj): Linear(in_features=17920, out_features=5120, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=5120, out_features=32064, bias=False)
)
INFO:     10.240.0.7:59630 - "GET /healthz HTTP/1.1" 200 OK
INFO:     10.240.0.7:36058 - "GET /healthz HTTP/1.1" 200 OK
INFO:     10.240.0.7:40306 - "GET /healthz HTTP/1.1" 200 OK
...
INFO:     10.240.0.7:40798 - "GET /healthz HTTP/1.1" 200 OK
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
WARNING:transformers_modules.weights.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.
INFO:     10.240.0.7:33918 - "GET /healthz HTTP/1.1" 200 OK
...

Why with the phi-3-mini this works, but with the phi-3-medium does not? 🫤

Any help is appreciated. Thank you!

ishaansehgal99 commented 1 month ago

Hi @rliberoff, thanks for sharing.

I recommend reducing the max length to speed up requests for the medium model. Additionally, we're adjusting the deployment specs to utilize all available GPUs, which should drastically improve inference time. We'll release this fix as soon as possible.

Fei-Guo commented 1 month ago

Hi @rliberoff, thanks for sharing.

I recommend reducing the max length to speed up requests for the medium model. Additionally, we're adjusting the deployment specs to utilize all available GPUs, which should drastically improve inference time. We'll release this fix as soon as possible.

To temporally try out the fix manually, you can edit the inference workload template and change the resource.request for GPU to 2 if you are using "Standard_NC12s_v3". You should see much better performance for inference.

rliberoff commented 1 month ago

Hi @Fei-Guo,

Thank you for the information. However, I'm not understanding what you mean by the inference workload template. I'm quite new on this.

Could you please provide or point to an example of one of these templates?

Thank you!

rliberoff commented 2 weeks ago

Hey guys,

Could you please tell me how can I adjust the deployment specs to utilize all available GPUs on the Standard_NC12s_v3?

Thank you!

ishaansehgal99 commented 2 weeks ago

Apologies for the delay. You can edit the deployment specification by running the following command:

kubectl edit deployment <deployment_name>

In the configuration, update the resource limits and requests from:

        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"

        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            nvidia.com/gpu: "2"

rliberoff commented 1 week ago

Hi @ishaansehgal99,

I'm so sorry, but I've been trying like crazy to make this work. I don't understand how can I change the values of nvidia.com/gpu from 1 to 2 without actually cloning this repo and kind of manually doing this changes.

Is there a way to set these configurations using the kaito.sh/v1alpha1 and Workspace CRDs? Or when deploying the Kaito workspace from the Helm Chart?

I'm trying to setup a demo that uses Terraform to deploy Kaito but using Phi-3 Medium (because Phi-3 Mini hallucinates a lot).

How can I indicate to use two GPUs?

To give you an idea this would be my current Terraform script for Kaito:

data "azurerm_subscription" "current" {
}

resource "azurerm_role_assignment" "kaito_provisioner_assigned_identity_contributor_role" {
  principal_id         = data.azurerm_user_assigned_identity.kaito_identity.principal_id
  scope                = var.aks_id
  role_definition_name = "Contributor"
}

resource "kubernetes_namespace" "kaito_namespace" {
  metadata {
    name = var.kaito_aks_namespace
  }
}

resource "azapi_update_resource" "enable_kaito" {
  count       = var.use_upstream_version ? 0 : 1
  type        = "Microsoft.ContainerService/managedClusters@2024-03-02-preview"
  resource_id = var.aks_id

  body = jsonencode({
    properties = {
      aiToolchainOperatorProfile = {
        enabled = true
      }
    }
  })
}

data "azurerm_user_assigned_identity" "kaito_identity" {
  name                = var.kaito_identity_name
  resource_group_name = var.kaito_identity_resource_group_name

  depends_on = [azapi_update_resource.enable_kaito]
}

resource "azurerm_federated_identity_credential" "kaito_federated_identity_credential" {
  name                = "id-federated-kaito"
  resource_group_name = data.azurerm_user_assigned_identity.kaito_identity.resource_group_name
  parent_id           = data.azurerm_user_assigned_identity.kaito_identity.id
  issuer              = var.aks_oidc_issuer_url
  audience            = ["api://AzureADTokenExchange"]
  subject             = var.use_upstream_version ? "system:serviceaccount:gpu-provisioner:gpu-provisioner" : "system:serviceaccount:kube-system:kaito-gpu-provisioner"
}

resource "helm_release" "kaito_workspace" {
  count            = var.use_upstream_version ? 1 : 0
  name             = "kaito-workspace"
  chart            = "${path.module}/charts/kaito/workspace/"
  namespace        = kubernetes_namespace.kaito_namespace.metadata.0.name
  create_namespace = false
}

resource "helm_release" "gpu_provisioner" {
  count = var.use_upstream_version ? 1 : 0
  name  = "kaito-gpu-provisioner"
  chart = "https://github.com/Azure/gpu-provisioner/raw/gh-pages/charts/gpu-provisioner-${var.gpu_provisioner_version}.tgz"
  wait  = true

  set {
    name  = "settings.azure.clusterName"
    value = var.aks_name
  }

  set {
    name  = "replicas"
    value = var.gpu_provisioner_replicas
  }

  set {
    name  = "controller.env[0].name"
    value = "ARM_SUBSCRIPTION_ID"
  }
  set {
    name  = "controller.env[0].value"
    value = data.azurerm_subscription.current.subscription_id
  }

  set {
    name  = "controller.env[1].name"
    value = "LOCATION"
  }
  set {
    name  = "controller.env[1].value"
    value = var.aks_location
  }

  set {
    name  = "controller.env[2].name"
    value = "AZURE_CLUSTER_NAME"
  }
  set {
    name  = "controller.env[2].value"
    value = var.aks_name
  }

  set {
    name  = "controller.env[3].name"
    value = "AZURE_NODE_RESOURCE_GROUP"
  }
  set {
    name  = "controller.env[3].value"
    value = var.aks_node_resource_group_name
  }

  set {
    name  = "controller.env[4].name"
    value = "ARM_RESOURCE_GROUP"
  }
  set {
    name  = "controller.env[4].value"
    value = var.resource_group_name
  }

  set {
    name  = "controller.env[5].name"
    value = "LEADER_ELECT"
  }
  set {
    name  = "controller.env[5].value"
    value = "false"
    type  = "string" # Forcefully set the type as `string` to avoid the error: `…cannot unmarshal bool into Go struct field EnvVar.spec.template.spec.containers.env.value of type string…`
  }

  set {
    name  = "controller.env[6].name"
    value = "E2E_TEST_MODE"
  }
  set {
    name  = "controller.env[6].value"
    value = "false"
    type  = "string" # Forcefully set the type as `string` to avoid the error: `…cannot unmarshal bool into Go struct field EnvVar.spec.template.spec.containers.env.value of type string…`
  }

  set {
    name  = "workloadIdentity.clientId"
    value = data.azurerm_user_assigned_identity.kaito_identity.client_id
  }

  set {
    name  = "workloadIdentity.tenantId"
    value = data.azurerm_user_assigned_identity.kaito_identity.tenant_id
  }
}

resource "kubectl_manifest" "kaito_ai_model" {
  yaml_body = <<-EOF
    apiVersion: kaito.sh/v1alpha1
    kind: Workspace
    metadata:
      name: kaito-${var.kaito_ai_model}
      namespace: ${kubernetes_namespace.kaito_namespace.metadata.0.name}
      annotations:
        kaito.sh/enablelb: "False"
    resource:
      count: 1
      instanceType: "${var.kaito_instance_type_vm_size}"
      labelSelector:
        matchLabels:
          apps: ${var.kaito_ai_model}
    inference:
      preset:
        name: "${var.kaito_ai_model}"
    EOF

  depends_on = [
    azapi_update_resource.enable_kaito,
    helm_release.kaito_workspace,
    helm_release.gpu_provisioner
  ]
}

resource "azurerm_network_security_rule" "kaito_ai_model_inference_network_security_rule" {
  name                        = "rule-${var.kaito_aks_namespace}-${var.kaito_inference_port}"
  priority                    = 100
  direction                   = "Inbound"
  access                      = "Allow"
  protocol                    = "Tcp"
  source_port_range           = "*"
  destination_port_range      = 80
  source_address_prefix       = "Internet"
  destination_address_prefix  = "*"
  resource_group_name         = var.resource_group_name
  network_security_group_name = var.network_security_group_name
}

resource "kubernetes_ingress_v1" "kaito_ai_model_inference_endpoint_ingress" {
  wait_for_load_balancer = true

  metadata {
    name      = "ingress-kaito-${var.kaito_ai_model}"
    namespace = kubernetes_namespace.kaito_namespace.metadata.0.name
    annotations = {
      "kubernetes.io/ingress.class" = "addon-http-application-routing"
    }
  }

  spec {
    rule {
      http {
        path {
          path      = "/chat"
          path_type = "Prefix"
          backend {
            service {
              name = "kaito-${var.kaito_ai_model}"
              port {
                number = 80
              }
            }
          }
        }
      }
    }
  }
}

Thank you in advance!!!

Fei-Guo commented 1 week ago

@rliberoff

You need to change the deployment object created by the kaito controller and update the pod template there. You need to do that using kubectl against the live k8s cluster. None of the terraform scripts should be changed.

Note: this is just a hacky workaround. The code fix has been checked in and will be available in next kaito release.

rliberoff commented 1 week ago

Hi @Fei-Guo,

Thank you for the answer.

I guess I will try to make Phi-3 Mini work and wait for the next release.

I really appreciate your help and patience with this guys! Thank guy 😀

rliberoff commented 1 week ago

By the way, I’m leaving the link to the repo here in case anyone needs it or finds it interesting.

🖥️ https://github.com/rliberoff/kaito-rag