GPT-NeoX-20B Triton Tutorial Not Functional

The same issue already discussed in #98, but it is not resolved despite that issue being closed with no obvious resolution.

When following the guide here https://docs.coreweave.com/compass/examples/triton-inference-server-fastertransformer#deploying-the-kubernetes-resources

The following error is the resultant:


2023-02-17 06:35:10 (30.2 MB/s) - 'EleutherAI/global_step150000/mp_rank_06_model_states.pt' saved [16291/16291]

--2023-02-17 06:35:10--  https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/global_step150000/mp_rank_07_model_states.pt
Reusing existing connection to the-eye.eu:443.
HTTP request sent, awaiting response... 200 OK
Length: 16291 (16K) [application/octet-stream]
Saving to: 'EleutherAI/global_step150000/mp_rank_07_model_states.pt'

     0K .......... .....                                      100% 28.9M=0.001s

2023-02-17 06:35:10 (28.9 MB/s) - 'EleutherAI/global_step150000/mp_rank_07_model_states.pt' saved [16291/16291]

FINISHED --2023-02-17 06:35:10--
Total wall clock time: 26m 30s
Downloaded: 108 files, 38G in 26m 4s (25.1 MB/s)
Converting from 2 to 1 GPUs
Strategy: group 2 source gpu(s) into 1 out gpu(s).

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 163, in handle_layer
    tensor.tofile(save_dir / ("model." + output_name + ".bin"))
OSError: problem writing element 73388032 to file
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 305, in <module>
    convert_checkpoint(args)
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 282, in convert_checkpoint
    pool.starmap(handle_layer, handle_layer_args)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
OSError: problem writing element 73388032 to file
mv: target '/mnt/pvc/triton-model-store/fastertransformer/1/' is not a directory
/bin/bash: line 245: echo: write error: Disk quota exceeded

While running the transform, the disks look like this:

root@gpt-neox-download-kbs8c:/mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1# df -h
Filesystem                                                                                                                                                                                     Size  Used Avail Use% Mounted on
overlay                                                                                                                                                                                        954G  115G  839G  13% /
tmpfs                                                                                                                                                                                           64M     0   64M   0% /dev
tmpfs                                                                                                                                                                                          189G     0  189G   0% /sys/fs/cgroup
/dev/nvme0n1                                                                                                                                                                                   954G  115G  839G  13% /etc/hosts
10.134.60.106:6789,10.134.60.107:6789,10.134.60.111:6789,10.134.60.110:6789,10.134.60.104:6789:/volumes/csi/csi-vol-879f3c54-ad98-11ed-8316-ceb1770f3736/9c854224-0777-46b4-94a8-c8dbf166e22e  100G   39G   62G  39% /mnt/pvc
shm                                                                                                                                                                                             64M   24K   64M   1% /dev/shm
tmpfs                                                                                                                                                                                          189G   12K  189G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                                                                                                          189G     0  189G   0% /proc/acpi
tmpfs                                                                                                                                                                                          189G     0  189G   0% /proc/scsi
tmpfs

It appears when transforming, it does approach around 115GB of space, not <100GB:

10.134.60.106:6789,10.134.60.107:6789,10.134.60.111:6789,10.134.60.110:6789,10.134.60.104:6789:/volumes/csi/csi-vol-879f3c54-ad98-11ed-8316-ceb1770f3736/9c854224-0777-46b4-94a8-c8dbf166e22e  200G  115G   86G  58% /mnt/pvc

I increased the PVC to 200GB and I receive a subsequent error:

--2023-02-17 07:29:56--  https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/global_step150000/mp_rank_07_model_states.pt
Reusing existing connection to the-eye.eu:443.
HTTP request sent, awaiting response... 200 OK
Length: 16291 (16K) [application/octet-stream]
Saving to: 'EleutherAI/global_step150000/mp_rank_07_model_states.pt'

     0K .......... .....                                      100% 14.4M=0.001s

2023-02-17 07:29:56 (14.4 MB/s) - 'EleutherAI/global_step150000/mp_rank_07_model_states.pt' saved [16291/16291]

FINISHED --2023-02-17 07:29:56--
Total wall clock time: 24m 26s
Downloaded: 108 files, 38G in 23m 59s (27.2 MB/s)
Converting from 2 to 1 GPUs
Strategy: group 2 source gpu(s) into 1 out gpu(s).

[INFO] Spend 0:21:10.420464 (h:m:s) to convert the model
mv: target '/mnt/pvc/triton-model-store/fastertransformer/1/' is not a directory

This is with consideration that the following line is present in the example:

https://github.com/coreweave/kubernetes-cloud/blob/master/online-inference/fastertransformer/download-weights-job-gpt-neox.yml#L31

mkdir /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1 -p

However this line:

https://github.com/coreweave/kubernetes-cloud/blob/master/online-inference/fastertransformer/download-weights-job-gpt-neox.yml#L36 appears to have a typo in it:

/mnt/pvc/triton-model-store/fastertransformer/1/

Notice the absence of gpt-neox in the path.

To summarize, the neoX job appears to require approximately 115GB of storage, not 100GB and typos in the command path prevent the script from running successfully. The following is a diff of the changes I have made to successfully pass this step:

diff --git a/online-inference/fastertransformer/model-storage-pvc.yml b/online-inference/fastertransformer/model-storage-pvc.yml
index 1a5a837..4d5d669 100644
--- a/online-inference/fastertransformer/model-storage-pvc.yml
+++ b/online-inference/fastertransformer/model-storage-pvc.yml
@@ -9,5 +9,5 @@ spec:
     - ReadWriteMany
   resources:
     requests:
-      storage: 100Gi
+      storage: 150Gi

diff --git a/online-inference/fastertransformer/download-weights-job-gpt-neox.yml b/online-inference/fastertransformer/download-weights-job-gpt-neox.yml
index 6b9b926..af48cb2 100644
--- a/online-inference/fastertransformer/download-weights-job-gpt-neox.yml
+++ b/online-inference/fastertransformer/download-weights-job-gpt-neox.yml
@@ -7,7 +7,8 @@ spec:
     spec:
       containers:
       - name: gpt-neox-model-downloader
-        image: nvcr.io/nvidia/tritonserver:22.05-py3
+        # not required; update to the latest triton
+        image: nvcr.io/nvidia/tritonserver:23.01-py3
         imagePullPolicy: IfNotPresent
         command:
         - /bin/sh
@@ -27,13 +28,13 @@ spec:
           git clone https://github.com/NVIDIA/FasterTransformer.git; 
           wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models; 
           wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models;
-          wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P EleutherAI;
+          wget -c --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P EleutherAI;
           mkdir /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1 -p; 
           python3 /mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py \
           /mnt/pvc/gpt-neox/EleutherAI/ \
           /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1 \
           --tensor-parallelism 1
-          mv /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1/1-gpu/* /mnt/pvc/triton-model-store/fastertransformer/1/;
+          mv /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1/1-gpu/* /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1/;
           touch /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/config.pbtxt
           echo '# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
           # Redistribution and use in source and binary forms, with or without

Finally, the stage to start the Inference server

kubectl apply -f ft-inference-service-neox.yml

Also does not complete successfully as-is with the output reporting only:

$ kubectl logs -f -l serving.kubeflow.org/inferenceservice=fastertransformer-triton-neox -c kfserving-container
I0217 08:48:14.821303 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1b3c000000' with size 268435456
I0217 08:48:14.821670 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0217 08:48:15.605419 1 model_repository_manager.cc:1077] loading: fastertransformer:1
I0217 08:48:16.066695 1 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I0217 08:48:16.066718 1 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.9
I0217 08:48:16.066722 1 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.9
I0217 08:48:16.066748 1 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0217 08:48:16.067782 1 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I0217 08:48:16.067796 1 libfastertransformer.cc:248] Sequence Batching: disabled
I0217 08:48:16.261397 1 libfastertransformer.cc:420] Before Loading Weights:

$ kubectl get isvc
NAME                            URL   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
fastertransformer-triton-neox         False                                                                 23m

I am still debugging this stage, however, and it may be more appropriate as a separate issue as it could be due to upgrading from Triton 22->23.

I am happy to PR the above changes if you are accepting.

coreweave / kubernetes-cloud

GPT-NeoX-20B Triton Tutorial Not Functional #148