Open fangxuezheng opened 3 months ago
Hey, thanks for the detailed issue.
can you run with DEBUG=2 and send the logs here?
Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?
Now the model can be loaded again, but I feel like it sometimes works and sometimes it doesn't. Moreover, after loading the model, I keep running data and chatting on tinychat, but the response is particularly slow and there are very few words given。 And Why is this computing tflops value always 0? I am using Windows Wsl2 ubantu20.04 here
Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?
Isn't this command used for the NVIDIA drivers/CUDA toolkit version? nvidia-smi And NVCC-V?
Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend
Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend In my previous screenshots, both my WSL ubantu NVIDIA drivers and CUDA toolkit are 12.5,Are these reasons all related to the NVIDIA drivers/CUDA toolkit?
I just bumped up the tinygrad version https://github.com/exo-explore/exo/commit/142682645f2c8b480e1105c1d8c2dc0a9b767815, since it was quite old. Can you try with the latest version?
I just bumped up the tinygrad version 1426826, since it was quite old. Can you try with the latest version?
These questions still exist, thank you
I'm running into this issue on my WSL2 instance in Windows 10 also. I think it has to do with a limitation in WSL around using pinned system memory: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps
I'm assuming that Tinygrad would need to implement a way to control if pinned memory is used or not. Looks like the llama.cpp folks implemented something like that as a workaround: https://github.com/ggerganov/llama.cpp/issues/1230
I think this is the issue in WSL (which is marked as closed but doesn't seem to be fixed): https://github.com/microsoft/WSL/issues/8447
is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU
is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU
That is a good idea (trying a different VM/Hypervisor than the WSL approach since the issue seems to be in WSL). My plan for a workaround is to dual-boot to Ubuntu but I haven't gotten around to it yet.
I think it could be a reasonable idea to ask the Tinygrad folks if there is a config flag to not use pinned memory (since I think that is the way llama.cpp got around the limitation) - but I don't think they have a GitHub issue related to this yet
Wanted to post some updates here just in case other people are in the same situation.
Hello, Team. Anybody found solution to avoid CUDA Error 2, out of memory
?
loaded weights in 4041.00 ms, 8.03 GB loaded at 1.99 GB/s
Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=15, n_layers=32): CUDA Error 2, out of memory
Traceback (most recent call last):
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 152, in alloc
try: return super().alloc(size, options)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
return self._alloc(size, options if options is not None else BufferOptions())
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}") # noqa: E501
RuntimeError: CUDA Error 2, out of memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ffamax/exo/exo/orchestration/standard_node.py", line 239, in _process_tensor
result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 76, in infer_tensor
await self.ensure_shard(shard)
File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 97, in ensure_shard
self.model = await asyncio.get_event_loop().run_in_executor(self.executor, build_transformer, model_path, shard, "8B" if "8b" in shard.model_id.lower() else "70B")
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 48, in build_transformer
load_state_dict(model, weights, strict=False, consume=False) # consume=True
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
else: v.replace(state_dict[k].to(v.device)).realize()
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
ret = fn(*args, **kwargs)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 213, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
ei.run(var_vals, do_update_stats=do_update_stats)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in run
bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in <listcomp>
bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 77, in ensure_allocated
def ensure_allocated(self) -> Buffer: return self.allocate() if not hasattr(self, '_buf') else self
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 86, in allocate
self._buf = opaque if opaque is not None else self.allocator.alloc(self.nbytes, self.options)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 155, in alloc
return super().alloc(size, options)
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
return self._alloc(size, options if options is not None else BufferOptions())
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}") # noqa: E501
RuntimeError: CUDA Error 2, out of memory
SendTensor tensor shard=Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=13, end_layer=21, n_layers=32) tensor=array([[[ 0.1719 , 0.2925 , -0.5254 , ..., 0.508 , 0.413 ,
-0.2148 ],
[ 0.1719 , 0.2925 , -0.5254 , ..., 0.508 , 0.413 ,
-0.2147 ],
[ 0.0528 , 0.006165, 0.02719 , ..., 0.10626 , 0.01511 ,
0.00949 ],
...,
[-0.004456, 0.09314 , 0.00821 , ..., -0.04398 , -0.02438 ,
-0.0692 ],
[-0.02142 , 0.0279 , -0.0904 , ..., -0.005966, -0.03247 ,
-0.0575 ],
[-0.0843 , -0.0978 , -0.00925 , ..., -0.01285 , -0.05417 ,
-0.0532 ]]], dtype=float16) request_id='cda2e3d0-2409-4e39-938d-029d198e67de' result: None
In my case GPUs was not defined so it was unable properly proceed. Once FLOPs defined, it was able split according to available VRAM on all GPUs. Example https://github.com/exo-explore/exo/pull/393/files
What is this memory overflow related to? My graphics card has 8g of video memory, so it's impossible to load the model until it reaches 8% and then terminate? Can you help analyze the reason? The following are graphics card information and error messages? thank you