NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.77k stars 882 forks source link

Converting GPT-NeoX weights to tensor-parallelism=4 throws Error #316

Closed rtalaricw closed 1 year ago

rtalaricw commented 1 year ago

Description

branch: main

Reproduced Steps

To reproduce this issue:

          pip3 install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html;
          pip3 install --extra-index-url https://pypi.ngc.nvidia.com regex fire tritonclient[all];
          pip3 install --upgrade jax jaxlib pyyaml dataclasses pathlib tdqm typing;
          git clone https://github.com/NVIDIA/FasterTransformer.git; 
          wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models; 
          wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models;
          wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P EleutherAI;
          mkdir /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1 -p; 
          python3 /mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py \
          /mnt/pvc/EleutherAI/ \
          /mnt/pvc/gpt-neox/triton-model-store/fastertransformer/1 \
          --tensor-parallelism 4

Error:

Converting from 2 to 4 GPUs
Strategy: group 1 source gpu(s) into 2 out gpu(s).

/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py:147: RuntimeWarning: divide by zero encountered in remainder
  if gather_tensor.shape[axis] % out_range != 0:
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 147, in handle_layer
    if gather_tensor.shape[axis] % out_range != 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 305, in <module>
    convert_checkpoint(args)
  File "/mnt/pvc/gpt-neox/FasterTransformer/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py", line 282, in convert_checkpoint
    pool.starmap(handle_layer, handle_layer_args)
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
byshiue commented 1 year ago

Thank you for the feedback. We try to fix this issue in branch https://github.com/NVIDIA/FasterTransformer/tree/fix/gptneox_convert, please try on this branch again.

byshiue commented 1 year ago

Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.