Dataset permission errors from the tokenizer in finetune-workflow

Hi! I'm working on reproducing your Argo workflow for fine-tuning GPT-J.

I'm able to create a PVC, download the dataset into it, and submit the argo workflow.

kubectl apply -f finetune-pvc.yaml
kubectl apply -f finetune-download-dataset.yaml
kubectl apply -f inference-role.yaml
argo submit finetune-workflow.yaml \
        -p run_name=example-gpt-j-6b \
        -p dataset=dataset \
        -p reorder=random \
        -p run_inference=true \
        -p inference_only=false\
        -p model=EleutherAI/gpt-j-6B \
        --serviceaccount inference

However, whenever I try to read the dataset in the tokenizer step of the workflow, it hits a filesystem access error for the PVC:

2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/tokenizer.json... 1.4 MB completed.
2023/01/19 04:30:29 Resolving /finetune-data/models/EleutherAI/gpt-j-6B/config.json...
2023/01/19 04:30:29 Downloaded /finetune-data/models/EleutherAI/gpt-j-6B/config.json... 930 B completed.
2023/01/19 04:30:29 open /finetune-data/dataset/: permission denied
time="2023-01-19T04:30:30.343Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 1

I tried updating the roles / role bindings for accessing the PVC but that still has issues:

$ git diff inference-role.yaml
diff --git a/finetuner-workflow/inference-role.yaml b/finetuner-workflow/inference-role.yaml
index 7d99bd1..3d50526 100644
--- a/finetuner-workflow/inference-role.yaml
+++ b/finetuner-workflow/inference-role.yaml
@@ -21,6 +21,9 @@ rules:
       - revisions
     verbs:
       - '*'
+  - apiGroups: [""]
+    resources: ["persistentvolumeclaims"]
+    verbs: ["get", "watch", "list"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding

The events listed for the relevant tokenizer pods do not show any warnings/errors for attaching to the PVC.

Still troubleshooting... must be missing some further permissions somewhere. Please let me know if you have suggestions in the meantime. Thanks in advance!

coreweave / kubernetes-cloud

Dataset permission errors from the tokenizer in finetune-workflow #129