Closed qfai closed 1 week ago
I have solved the problem by creating a kubernates secret with kubectl
create secret generic dockerconfig --from-file=.dockerconfigjson=config.json
the content of the config.json is { "auths": { "https://index.docker.io/v1/": { "auth": "xxxxx" } } }
the xxx is the base64 formed username and password from my acr token using following command to generate
echo -n 'username:password' | base64
Describe the bug finetune pod is error pull image Failed to pull image "curlimages/curl": failed to pull and unpack image "docker.io/curlimages/curl:latest": failed to resolve reference "docker.io/curlimages/curl:latest": failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://auth.docker.io/token?scope=repository%3Acurlimages%2Fcurl%3Apull&service=registry.docker.io: 401 Unauthorized
Steps To Reproduce kubectl apply -f .\examples\fine-tuning\kaito_workspace_tuning_phi_3.yaml apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-tuning-phi-3 resource: instanceType: "Standard_NC12s_v3" labelSelector: matchLabels: app: tuning-phi-3 tuning: preset: name: phi-3-mini-128k-instruct method: qlora input: urls:
Expected behavior I think aks should be able to pull "curlimages/curl" image
Logs k describe pods workspace-tuning-phi-3-qcpqd Name: workspace-tuning-phi-3-qcpqd Namespace: default Priority: 0 Service Account: default Node: aks-gpu-38432077-vmss000000/10.224.0.5 Start Time: Sun, 29 Sep 2024 15:59:54 +0800 Labels: batch.kubernetes.io/controller-uid=3e777e07-a66c-475a-93c3-d8d88daea23c batch.kubernetes.io/job-name=workspace-tuning-phi-3 controller-uid=3e777e07-a66c-475a-93c3-d8d88daea23c job-name=workspace-tuning-phi-3 kaito.sh/workspace=workspace-tuning-phi-3 Annotations:
Status: Pending
IP: 10.244.2.162
IPs:
IP: 10.244.2.162
Controlled By: Job/workspace-tuning-phi-3
Init Containers:
data-downloader:
Container ID:
Image: curlimages/curl
Image ID:
Port:
Host Port:
Command:
sh
-c
Containers: workspace-tuning-phi-3: Container ID: Image: mcr.microsoft.com/aks/kaito/kaito-phi-3-mini-128k-instruct:0.0.2 Image ID: Port: 5000/TCP Host Port: 0/TCP Command: /bin/sh -c python3 metrics_server.py & accelerate launch --num_processes=2 fine_tuning.py State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: nvidia.com/gpu: 2 Requests: nvidia.com/gpu: 2 Environment: DEFAULT_TARGET_MODULES: k_proj,q_proj,v_proj,o_proj,gate_proj,down_proj,up_proj PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True Mounts: /mnt/config from config-volume (rw) /mnt/data from data-volume (rw) /mnt/results from results-volume (rw) /tmp/.docker/config from docker-config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-66w8x (ro) docker-sidecar: Container ID: Image: docker:dind Image ID: Port:
Host Port:
Command:
/bin/sh
-c
Args:
Conditions: Type Status PodReadyToStartContainers True Initialized False Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: qlora-params-template Optional: false results-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit:
data-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
docker-config:
Type: Projected (a volume that contains injected data from multiple sources)
SecretName: finetuneacr
SecretOptionalName:
kube-api-access-66w8x:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: gpu:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
sku=gpu:NoSchedule
Events:
Type Reason Age From Message
Normal Scheduled 15m default-scheduler Successfully assigned default/workspace-tuning-phi-3-qcpqd to aks-gpu-38432077-vmss000000 Normal Pulling 13m (x4 over 15m) kubelet Pulling image "curlimages/curl" Warning Failed 13m (x4 over 15m) kubelet Failed to pull image "curlimages/curl": failed to pull and unpack image "docker.io/curlimages/curl:latest": failed to resolve reference "docker.io/curlimages/curl:latest": failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://auth.docker.io/token?scope=repository%3Acurlimages%2Fcurl%3Apull&service=registry.docker.io: 401 Unauthorized Warning Failed 13m (x4 over 15m) kubelet Error: ErrImagePull Warning Failed 13m (x6 over 15m) kubelet Error: ImagePullBackOff Normal BackOff 4s (x64 over 15m) kubelet Back-off pulling image "curlimages/curl"
Environment I used a portal created AKS
kubectl version
): Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.8cat /etc/os-release
): Windows 11Additional context