exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
10.76k stars 613 forks source link

CUDA Error 2, out of memory #101

Open fangxuezheng opened 3 months ago

fangxuezheng commented 3 months ago

What is this memory overflow related to? My graphics card has 8g of video memory, so it's impossible to load the model until it reaches 8% and then terminate? Can you help analyze the reason? The following are graphics card information and error messages? thank you

image image image
AlexCheema commented 3 months ago

Hey, thanks for the detailed issue.

can you run with DEBUG=2 and send the logs here?

AlexCheema commented 3 months ago

Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?

fangxuezheng commented 3 months ago

Now the model can be loaded again, but I feel like it sometimes works and sometimes it doesn't. Moreover, after loading the model, I keep running data and chatting on tinychat, but the response is particularly slow and there are very few words given。 And Why is this computing tflops value always 0? I am using Windows Wsl2 ubantu20.04 here

image image
fangxuezheng commented 3 months ago

Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?

Isn't this command used for the NVIDIA drivers/CUDA toolkit version? nvidia-smi And NVCC-V?

AlexCheema commented 3 months ago

Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend

fangxuezheng commented 3 months ago

Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend In my previous screenshots, both my WSL ubantu NVIDIA drivers and CUDA toolkit are 12.5,Are these reasons all related to the NVIDIA drivers/CUDA toolkit?

AlexCheema commented 3 months ago

I just bumped up the tinygrad version https://github.com/exo-explore/exo/commit/142682645f2c8b480e1105c1d8c2dc0a9b767815, since it was quite old. Can you try with the latest version?

fangxuezheng commented 2 months ago

I just bumped up the tinygrad version 1426826, since it was quite old. Can you try with the latest version?

These questions still exist, thank you

pickettd commented 3 weeks ago

I'm running into this issue on my WSL2 instance in Windows 10 also. I think it has to do with a limitation in WSL around using pinned system memory: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps

I'm assuming that Tinygrad would need to implement a way to control if pinned memory is used or not. Looks like the llama.cpp folks implemented something like that as a workaround: https://github.com/ggerganov/llama.cpp/issues/1230

I think this is the issue in WSL (which is marked as closed but doesn't seem to be fixed): https://github.com/microsoft/WSL/issues/8447

Shivp1413 commented 1 week ago

is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU

pickettd commented 1 week ago

is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU

That is a good idea (trying a different VM/Hypervisor than the WSL approach since the issue seems to be in WSL). My plan for a workaround is to dual-boot to Ubuntu but I haven't gotten around to it yet.

I think it could be a reasonable idea to ask the Tinygrad folks if there is a config flag to not use pinned memory (since I think that is the way llama.cpp got around the limitation) - but I don't think they have a GitHub issue related to this yet

pickettd commented 1 week ago

Wanted to post some updates here just in case other people are in the same situation.

FFAMax commented 5 hours ago

Hello, Team. Anybody found solution to avoid CUDA Error 2, out of memory?

loaded weights in 4041.00 ms, 8.03 GB loaded at 1.99 GB/s
Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=15, n_layers=32): CUDA Error 2, out of memory
Traceback (most recent call last):
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 152, in alloc
    try: return super().alloc(size, options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ffamax/exo/exo/orchestration/standard_node.py", line 239, in _process_tensor
    result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 76, in infer_tensor
    await self.ensure_shard(shard)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 97, in ensure_shard
    self.model = await asyncio.get_event_loop().run_in_executor(self.executor, build_transformer, model_path, shard, "8B" if "8b" in shard.model_id.lower() else "70B")
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 48, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False)  # consume=True
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in run
    bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in <listcomp>
    bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 77, in ensure_allocated
    def ensure_allocated(self) -> Buffer: return self.allocate() if not hasattr(self, '_buf') else self
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 86, in allocate
    self._buf = opaque if opaque is not None else self.allocator.alloc(self.nbytes, self.options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 155, in alloc
    return super().alloc(size, options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory
SendTensor tensor shard=Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=13, end_layer=21, n_layers=32) tensor=array([[[ 0.1719  ,  0.2925  , -0.5254  , ...,  0.508   ,  0.413   ,
         -0.2148  ],
        [ 0.1719  ,  0.2925  , -0.5254  , ...,  0.508   ,  0.413   ,
         -0.2147  ],
        [ 0.0528  ,  0.006165,  0.02719 , ...,  0.10626 ,  0.01511 ,
          0.00949 ],
        ...,
        [-0.004456,  0.09314 ,  0.00821 , ..., -0.04398 , -0.02438 ,
         -0.0692  ],
        [-0.02142 ,  0.0279  , -0.0904  , ..., -0.005966, -0.03247 ,
         -0.0575  ],
        [-0.0843  , -0.0978  , -0.00925 , ..., -0.01285 , -0.05417 ,
         -0.0532  ]]], dtype=float16) request_id='cda2e3d0-2409-4e39-938d-029d198e67de' result: None
FFAMax commented 4 hours ago

In my case GPUs was not defined so it was unable properly proceed. Once FLOPs defined, it was able split according to available VRAM on all GPUs. Example https://github.com/exo-explore/exo/pull/393/files