coreweave / kubernetes-cloud

Getting Started with the CoreWeave Kubernetes GPU Cloud
http://www.coreweave.com
66 stars 45 forks source link

Triton Neox example is not working #98

Closed chitalian closed 1 year ago

chitalian commented 1 year ago

When going through this tutorial: https://docs.coreweave.com/compass/examples/triton-inference-server-fastertransformer

At this step

kubectl apply -f download-weights-job-gpt-neox.yml

I am running into this error:

--2022-11-08 03:52:57--  https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/
Resolving mystic.the-eye.eu (mystic.the-eye.eu)... 62.6.154.15
Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection refused.
Traceback (most recent call last):
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 305, in <module>
    convert_checkpoint(args)
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 172, in convert_checkpoint
    with open(base_dir / "latest") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/pvc/gpt-neox/EleutherAI/latest'
mv: cannot stat '/mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1/1-gpu/*': No such file or directory
salanki commented 1 year ago

Seems as this mirror is down, which is currently the official mirror of the NeoX 20B weights. I guess we should get our own, given that we trained the model and all.

salanki commented 1 year ago

@chitalian you can change the address from https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ to https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ in the download job script and it should work.

chitalian commented 1 year ago

@chitalian you can change the address from https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ to https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ in the download job script and it should work.

That seemed to do the trick!! thanks for the quick response

Want me to make a PR to fix this?

salanki commented 1 year ago

Please!!

chitalian commented 1 year ago

@salanki Really strange actually...

I got a lot further but I left the logs open when i was retrying and got this

FINISHED --2022-11-08 19:40:25--
Total wall clock time: 19m 31s
Downloaded: 108 files, 38G in 19m 3s (34.3 MB/s)
Converting from 2 to 1 GPUs
Strategy: group 2 source gpu(s) into 1 out gpu(s).

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 163, in handle_layer
    tensor.tofile(save_dir / ("model." + output_name + ".bin"))
OSError: problem writing element 131747840 to file
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 305, in <module>
    convert_checkpoint(args)
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 282, in convert_checkpoint
    pool.starmap(handle_layer, handle_layer_args)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
OSError: problem writing element 131747840 to file
mv: target '/mnt/pvc/triton-model-store/fastertransformer/1/' is not a directory
/bin/bash: line 245: echo: write error: Disk quota exceeded

also on my dashboard the storage instance looks empty 🤔

image
salanki commented 1 year ago

Can you try to increase the volume? It shows as 0% because it is not currently mounted, the text below the 0% is trying to explain that.

chitalian commented 1 year ago

fixed with #100

zaventh commented 1 year ago

This is still an issue. Will open a new issue.

salanki commented 1 year ago

Please do, add as much detail as you can.