KoboldAI / KoboldAI-Client

For GGUF support, see KoboldCPP: https://github.com/LostRuins/koboldcpp
https://koboldai.com
GNU Affero General Public License v3.0
3.52k stars 760 forks source link

Colab's Premium GPUs unsupported #211

Open minipasila opened 1 year ago

minipasila commented 1 year ago

You cannot run any of the models when using their Premium GPUs like NVIDIA A100, which gives this error:

Launching KoboldAI with the following options : python3 aiserver.py --model KoboldAI/OPT-6B-nerys-v2 --localtunnel --colab
2023-01-12 21:46:23.469478: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Colab Check: True
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:146: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

And when trying to generate something:

  ERROR      | __main__:generate:6333 - Traceback (most recent call last):
  File "aiserver.py", line 6320, in generate
    genout, already_generated = tpool.execute(core_generate, txt, minimum, maximum, found_entries)
  File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "aiserver.py", line 5508, in core_generate
    result = raw_generate(
  File "aiserver.py", line 5736, in raw_generate
    batch_encoded = torch_raw_generate(
  File "aiserver.py", line 5832, in torch_raw_generate
    genout = generator(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 1324, in generate
    model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation_utils.py", line 556, in _prepare_attention_mask_for_generation
    is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 782, in __contains__
    return (element == self).any().item()  # type: ignore[union-attr]
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Reasons for why I wanted to try using premium GPUs is just because they have better availability and more vram so you can actually use the full 2048 tokens in memory (and you should be able to run 13B models with those as well).

henk717 commented 1 year ago

Don't do this, use the TPU instead. It will only consume 2 compute credits per hour.

minipasila commented 1 year ago

I would if they were actually available for once.

FunkEngine2023 commented 1 year ago

I would if they were actually available for once.

Chuck $9.99 into a separate google account, which is only for TPU fiddling. $0.20 an hour isn't a terrible rate to pay, it's about how much the power they consume when running full-ham actually costs Google in terms of paying the electricity bills.