Azure / kaito

Kubernetes AI Toolchain Operator
MIT License
406 stars 45 forks source link

Failed to pull image "curlimages/curl" when finetune with qlora method #609

Closed qfai closed 1 week ago

qfai commented 1 week ago

Describe the bug finetune pod is error pull image Failed to pull image "curlimages/curl": failed to pull and unpack image "docker.io/curlimages/curl:latest": failed to resolve reference "docker.io/curlimages/curl:latest": failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://auth.docker.io/token?scope=repository%3Acurlimages%2Fcurl%3Apull&service=registry.docker.io: 401 Unauthorized

Steps To Reproduce kubectl apply -f .\examples\fine-tuning\kaito_workspace_tuning_phi_3.yaml apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-tuning-phi-3 resource: instanceType: "Standard_NC12s_v3" labelSelector: matchLabels: app: tuning-phi-3 tuning: preset: name: phi-3-mini-128k-instruct method: qlora input: urls:

Expected behavior I think aks should be able to pull "curlimages/curl" image

Logs k describe pods workspace-tuning-phi-3-qcpqd Name: workspace-tuning-phi-3-qcpqd Namespace: default Priority: 0 Service Account: default Node: aks-gpu-38432077-vmss000000/10.224.0.5 Start Time: Sun, 29 Sep 2024 15:59:54 +0800 Labels: batch.kubernetes.io/controller-uid=3e777e07-a66c-475a-93c3-d8d88daea23c batch.kubernetes.io/job-name=workspace-tuning-phi-3 controller-uid=3e777e07-a66c-475a-93c3-d8d88daea23c job-name=workspace-tuning-phi-3 kaito.sh/workspace=workspace-tuning-phi-3 Annotations: Status: Pending IP: 10.244.2.162 IPs: IP: 10.244.2.162 Controlled By: Job/workspace-tuning-phi-3 Init Containers: data-downloader: Container ID: Image: curlimages/curl Image ID: Port: Host Port: Command: sh -c

                    if [ -z "$DATA_URLS" ]; then
                      echo "No URLs provided in DATA_URLS."
                      exit 1
                    fi
                    for url in $DATA_URLS; do
                      filename=$(basename "$url" | sed 's/[?=&]/_/g')
                      echo "Downloading $url to $DATA_VOLUME_PATH/$filename"
                      retry_count=0
                      while [ $retry_count -lt 3 ]; do
                        http_status=$(curl -sSL -w "%{http_code}" -o "$DATA_VOLUME_PATH/$filename" "$url")
                        curl_exit_status=$?  # Save the exit status of curl immediately
                        if [ "$http_status" -eq 200 ] && [ -s "$DATA_VOLUME_PATH/$filename" ] && [ $curl_exit_status -eq 0 ]; then
                          echo "Successfully downloaded $url"
                          break
                        else
                          echo "Failed to download $url, HTTP status code: $http_status, retrying..."
                          retry_count=$((retry_count + 1))
                          rm -f "$DATA_VOLUME_PATH/$filename" # Remove incomplete file
                          sleep 2
                        fi
                      done
                      if [ $retry_count -eq 3 ]; then
                        echo "Failed to download $url after 3 attempts"
                        exit 1  # Exit with a non-zero status to indicate failure
                      fi
                    done
                    echo "All downloads completed successfully"

State:          Waiting
  Reason:       ImagePullBackOff
Ready:          False
Restart Count:  0
Environment:
  DATA_URLS:         https://huggingface.co/datasets/philschmid/dolly-15k-oai-style/resolve/main/data/train-00000-of-00001-54e3756291ca09c6.parquet?download=true
  DATA_VOLUME_PATH:  /mnt/data
Mounts:
  /mnt/data from data-volume (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-66w8x (ro)

Containers: workspace-tuning-phi-3: Container ID: Image: mcr.microsoft.com/aks/kaito/kaito-phi-3-mini-128k-instruct:0.0.2 Image ID: Port: 5000/TCP Host Port: 0/TCP Command: /bin/sh -c python3 metrics_server.py & accelerate launch --num_processes=2 fine_tuning.py State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: nvidia.com/gpu: 2 Requests: nvidia.com/gpu: 2 Environment: DEFAULT_TARGET_MODULES: k_proj,q_proj,v_proj,o_proj,gate_proj,down_proj,up_proj PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True Mounts: /mnt/config from config-volume (rw) /mnt/data from data-volume (rw) /mnt/results from results-volume (rw) /tmp/.docker/config from docker-config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-66w8x (ro) docker-sidecar: Container ID: Image: docker:dind Image ID: Port: Host Port: Command: /bin/sh -c Args:

  # Start the Docker daemon in the background with specific options for DinD
  dockerd &
  # Wait for the Docker daemon to be ready
  while ! docker info > /dev/null 2>&1; do
    echo "Waiting for Docker daemon to start..."
    sleep 1
  done
  echo 'Docker daemon started'

  PUSH_SUCCEEDED=false

  while true; do
    FILE_PATH=$(find /mnt/results -name 'fine_tuning_completed.txt')
    if [ ! -z "$FILE_PATH" ]; then
      if [ "$PUSH_SUCCEEDED" = false ]; then
        echo "FOUND TRAINING COMPLETED FILE at $FILE_PATH"

        PARENT_DIR=$(dirname "$FILE_PATH")
        echo "Parent directory is $PARENT_DIR"

        TEMP_CONTEXT=$(mktemp -d)
        cp "$PARENT_DIR/adapter_config.json" "$TEMP_CONTEXT/adapter_config.json"
        cp -r "$PARENT_DIR/adapter_model.safetensors" "$TEMP_CONTEXT/adapter_model.safetensors"

        # Create a minimal Dockerfile
        echo 'FROM busybox:latest
        RUN mkdir -p /data
        ADD adapter_config.json /data/
        ADD adapter_model.safetensors /data/' > "$TEMP_CONTEXT/Dockerfile"

      # Add symbolic link to read-only mounted config.json
        mkdir -p /root/.docker
      ln -s /tmp/.docker/config/config.json /root/.docker/config.json

        docker build -t kaitofinetune.azurecr.io/phifinetune:0.0.1 "$TEMP_CONTEXT"

        while true; do
          if docker push kaitofinetune.azurecr.io/phifinetune:0.0.1; then
            echo "Upload complete"
            # Cleanup: Remove the temporary directory
            rm -rf "$TEMP_CONTEXT"
            # Remove the file to prevent repeated builds
            rm "$FILE_PATH"
            PUSH_SUCCEEDED=true
            # Signal completion
            touch /tmp/upload_complete
            exit 0
          else
            echo "Push failed, retrying in 30 seconds..."
            sleep 30
          fi
        done
      fi
    fi
    sleep 10  # Check every 10 seconds
  done
State:          Waiting
  Reason:       PodInitializing
Ready:          False
Restart Count:  0
Environment:    <none>
Mounts:
  /mnt/config from config-volume (rw)
  /mnt/data from data-volume (rw)
  /mnt/results from results-volume (rw)
  /tmp/.docker/config from docker-config (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-66w8x (ro)

Conditions: Type Status PodReadyToStartContainers True Initialized False Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: qlora-params-template Optional: false results-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: data-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: docker-config: Type: Projected (a volume that contains injected data from multiple sources) SecretName: finetuneacr SecretOptionalName: kube-api-access-66w8x: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: gpu:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s nvidia.com/gpu:NoSchedule op=Exists sku=gpu:NoSchedule Events: Type Reason Age From Message


Normal Scheduled 15m default-scheduler Successfully assigned default/workspace-tuning-phi-3-qcpqd to aks-gpu-38432077-vmss000000 Normal Pulling 13m (x4 over 15m) kubelet Pulling image "curlimages/curl" Warning Failed 13m (x4 over 15m) kubelet Failed to pull image "curlimages/curl": failed to pull and unpack image "docker.io/curlimages/curl:latest": failed to resolve reference "docker.io/curlimages/curl:latest": failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://auth.docker.io/token?scope=repository%3Acurlimages%2Fcurl%3Apull&service=registry.docker.io: 401 Unauthorized Warning Failed 13m (x4 over 15m) kubelet Error: ErrImagePull Warning Failed 13m (x6 over 15m) kubelet Error: ImagePullBackOff Normal BackOff 4s (x64 over 15m) kubelet Back-off pulling image "curlimages/curl"

Environment I used a portal created AKS

Additional context

qfai commented 1 week ago

I have solved the problem by creating a kubernates secret with kubectl create secret generic dockerconfig --from-file=.dockerconfigjson=config.json

the content of the config.json is { "auths": { "https://index.docker.io/v1/": { "auth": "xxxxx" } } }

the xxx is the base64 formed username and password from my acr token using following command to generate echo -n 'username:password' | base64