coreweave / kubernetes-cloud

Getting Started with the CoreWeave Kubernetes GPU Cloud
http://www.coreweave.com
66 stars 45 forks source link

Dataset permission errors from the tokenizer in finetune-workflow #129

Open parallelo opened 1 year ago

parallelo commented 1 year ago

Hi! I'm working on reproducing your Argo workflow for fine-tuning GPT-J.

I'm able to create a PVC, download the dataset into it, and submit the argo workflow.

kubectl apply -f finetune-pvc.yaml
kubectl apply -f finetune-download-dataset.yaml
kubectl apply -f inference-role.yaml
argo submit finetune-workflow.yaml \
        -p run_name=example-gpt-j-6b \
        -p dataset=dataset \
        -p reorder=random \
        -p run_inference=true \
        -p inference_only=false\
        -p model=EleutherAI/gpt-j-6B \
        --serviceaccount inference

However, whenever I try to read the dataset in the tokenizer step of the workflow, it hits a filesystem access error for the PVC:

2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/tokenizer.json... 1.4 MB completed.
2023/01/19 04:30:29 Resolving /finetune-data/models/EleutherAI/gpt-j-6B/config.json...
2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/config.json... 930 B completed.
2023/01/19 04:30:29 open /finetune-data/dataset/: permission denied
time="2023-01-19T04:30:30.343Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

I tried updating the roles / role bindings for accessing the PVC but that still has issues:

$ git diff inference-role.yaml
diff --git a/finetuner-workflow/inference-role.yaml b/finetuner-workflow/inference-role.yaml
index 7d99bd1..3d50526 100644
--- a/finetuner-workflow/inference-role.yaml
+++ b/finetuner-workflow/inference-role.yaml
@@ -21,6 +21,9 @@ rules:
       - revisions
     verbs:
       - '*'
+  - apiGroups: [""]
+    resources: ["persistentvolumeclaims"]
+    verbs: ["get", "watch", "list"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding

The events listed for the relevant tokenizer pods do not show any warnings/errors for attaching to the PVC.

Still troubleshooting... must be missing some further permissions somewhere. Please let me know if you have suggestions in the meantime. Thanks in advance!

parallelo commented 1 year ago

Still digging... seems like the mountPaths are goofed up in the Argo Workflow?

filebrowser pod (WORKS CORRECTLY):

    volumeMounts:
    - mountPath: /data/finetune-data
      name: finetune-data
  ...
  volumes:
  - name: finetune-data
    persistentVolumeClaim:
      claimName: finetune-data

finetune-model-tokenizer pod (READ PERMISSION ERROR):

    volumeMounts:
    - mountPath: /finetune-data
      name: finetune-data
  ...
  volumes:
  - name: finetune-data
    persistentVolumeClaim:
      claimName: finetune-data

Edit: Previously referenced mainctrfs, but that was just the wait container. Now just looking into mountPath values set as: