Having trouble running on ubuntu linux with 4090 (cuda 12.2)

Tried tensorflow and torch with tinygrad still getting this error with llama 8b 3.1 and llama 8b as well. Apparently this is an opencl compile error for bfloat16 data type Sorry, I am not a kernel programmer and i am unable to debug this further. Not sure why it is using opencl when it should be using CUDA...

Traceback (most recent call last): File "/home/medpixels/Downloads/exo/exo/api/chatgpt_api.py", line 306, in handle_post_chat_completions await self.node.process_prompt(shard, prompt, image_str, request_id=request_id) File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 102, in process_prompt resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 140, in _process_prompt result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 60, in infer_prompt await self.ensure_shard(shard) File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 96, in ensure_shard self.model = build_transformer(model_path, shard, model_size="8B" if "8b" in shard.model_id.lower() else "70B") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 51, in build_transformer load_state_dict(model, weights, strict=False, consume=False) # consume=True ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict else: v.replace(state_dict[k].to(v.device)).realize() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 3263, in _wrapper ret = fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 204, in realize run_schedule(self.schedule_with_vars(*lst), do_update_stats=do_update_stats) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 221, in run_schedule for ei in lower_schedule(schedule): File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 214, in lower_schedule raise e File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 208, in lower_schedule try: yield lower_schedule_item(si) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 192, in lower_schedule_item runner = get_runner(si.outputs[0].device, si.ast) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 161, in get_runner method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 83, in init self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 182, in compile_cached lib = self.compile(src) ^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/runtime/ops_gpu.py", line 27, in compile raise CompileError(f"OpenCL Compile Error\n\n{mstr.value.decode()}") tinygrad.device.CompileError: OpenCL Compile Error

:2:66: error: unknown type name '__bf16' __kernel void E_131072_32_4(__global half* data0, const __global __bf16* data1) { ^ :6:3: error: use of undeclared identifier '__bf16' __bf16 val0 = data1[alu0+1]; ^ :7:3: error: use of undeclared identifier '__bf16' __bf16 val1 = data1[alu0+2]; ^ :8:3: error: use of undeclared identifier '__bf16' __bf16 val2 = data1[alu0+3]; ^ :9:3: error: use of undeclared identifier '__bf16' __bf16 val3 = data1[alu0]; ^ :10:53: error: use of undeclared identifier 'val3'; did you mean 'all'? *((__global half4*)(data0+alu0)) = (half4)((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^~~~ all cl_kernel.h:7285:22: note: 'all' declared here int __OVERLOADABLE__ all(long16 in); ^ :10:52: error: pointer cannot be cast to type 'half' *((__global half4*)(data0+alu0)) = (half4)((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^~~~~~ Collecting topology max_depth=4 visited=set()

Sorry, it appears i forgot to install the cuda toolkit. now I have pytorch and cuda installed - i get a similar error about conversions but now it is in the compiler_cuda.py

I tested torch with cuda separately in a python script in the same conda environment and it seems to be running on cuda 12.2 and runs well on the gpu but tinygrad seems not to be able to run.

loaded weights in 1205.38 ms, 0.03 GB loaded at 0.03 GB/s Traceback (most recent call last): File "/home/medpixels/Downloads/exo/exo/api/chatgpt_api.py", line 308, in handle_post_chat_completions await self.node.process_prompt(shard, prompt, image_str, request_id=request_id) File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 102, in process_prompt resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 140, in _process_prompt result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 60, in infer_prompt await self.ensure_shard(shard) File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 96, in ensure_shard self.model = build_transformer(model_path, shard, model_size="8B" if "8b" in shard.model_id.lower() else "70B") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 51, in build_transformer load_state_dict(model, weights, strict=False, consume=False) # consume=True ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict else: v.replace(state_dict[k].to(v.device)).realize() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 3263, in _wrapper ret = fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 204, in realize run_schedule(self.schedule_with_vars(*lst), do_update_stats=do_update_stats) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 221, in run_schedule for ei in lower_schedule(schedule): File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 214, in lower_schedule raise e File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 208, in lower_schedule try: yield lower_schedule_item(si) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 192, in lower_schedule_item runner = get_runner(si.outputs[0].device, si.ast) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 161, in get_runner method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 83, in init self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 182, in compile_cached lib = self.compile(src) ^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/runtime/support/compiler_cuda.py", line 64, in compile def compile(self, src:str) -> bytes: return self._compile_program(src, nvrtc.nvrtcGetCUBIN, nvrtc.nvrtcGetCUBINSize) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/runtime/support/compiler_cuda.py", line 56, in _compile_program nvrtc_check(nvrtc.nvrtcCompileProgram(prog, len(self.compile_options), to_char_p_p([o.encode() for o in self.compile_options])), prog) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/runtime/support/compiler_cuda.py", line 16, in nvrtc_check raise CompileError(f"Nvrtc Error {status}, {ctypes.string_at(nvrtc.nvrtcGetErrorString(status)).decode()}\n{err_log}") tinygrad.device.CompileError: Nvrtc Error 6, NVRTC_ERROR_COMPILATION

(17): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) *((half4*)(data0+alu0)) = make_half4((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^ (17): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) *((half4*)(data0+alu0)) = make_half4((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^ (17): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) *((half4*)(data0+alu0)) = make_half4((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^ (17): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies: function "__half::__half(float)" (declared at line 214 of /usr/include/cuda_fp16.hpp) function "__half::__half(short)" (declared at line 227 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned short)" (declared at line 228 of /usr/include/cuda_fp16.hpp) function "__half::__half(int)" (declared at line 229 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned int)" (declared at line 230 of /usr/include/cuda_fp16.hpp) function "__half::__half(long long)" (declared at line 231 of /usr/include/cuda_fp16.hpp) function "__half::__half(unsigned long long)" (declared at line 232 of /usr/include/cuda_fp16.hpp) *((half4*)(data0+alu0)) = make_half4((half)(val3),(half)(val0),(half)(val1),(half)(val2)); ^ 4 errors detected in the compilation of "".

UPDATE:

I used the env variable as follows: DEBUG=9 SUPPORT_BF16=0 python main.py and i get a different error:

Excluded model param keys for shard=Shard(model_id='TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-R', start_layer=0, end_layer=31, n_layers=32): [] 0%| | 0/292 [00:00<?, ?it/s] loaded weights in 10431.11 ms, 0.07 GB loaded at 0.01 GB/s Traceback (most recent call last): File "/home/medpixels/Downloads/exo/exo/api/chatgpt_api.py", line 308, in handle_post_chat_completions await self.node.process_prompt(shard, prompt, image_str, request_id=request_id) File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 102, in process_prompt resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/orchestration/standard_node.py", line 140, in _process_prompt result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 60, in infer_prompt await self.ensure_shard(shard) File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 96, in ensure_shard self.model = build_transformer(model_path, shard, model_size="8B" if "8b" in shard.model_id.lower() else "70B") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/Downloads/exo/exo/inference/tinygrad/inference.py", line 51, in build_transformer load_state_dict(model, weights, strict=False, consume=False) # consume=True ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict else: v.replace(state_dict[k].to(v.device)).realize() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 3263, in _wrapper ret = fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/tensor.py", line 204, in realize run_schedule(self.schedule_with_vars(*lst), do_update_stats=do_update_stats) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 223, in run_schedule ei.run(var_vals, do_update_stats=do_update_stats) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 173, in run et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 139, in call self.copy(dest, src) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 134, in copy dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 113, in as_buffer return self.copyout(memoryview(bytearray(self.nbytes))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 124, in copyout self.allocator.copyout(mv, self._buf) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 646, in copyout self.device.synchronize() File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/device.py", line 508, in synchronize self.timeline_signal.wait(self.timeline_value - 1) File "/home/medpixels/anaconda3/envs/exo/lib/python3.12/site-packages/tinygrad/runtime/ops_nv.py", line 83, in wait raise RuntimeError(f"wait_result: {timeout} ms TIMEOUT!") RuntimeError: wait_result: 10000 ms TIMEOUT! Collecting topology max_depth=4 visited=set()

it's for Mac, on my Linux it's kinda works - showing the "Linux node" params in terminal (presumably for Mac as master). Not working in Linux unfortunately. No way i would waste money on Macs.

So, actually same issue here.

debug error:

    raise CompileError(f"OpenCL Compile Error\n\n{mstr.value.decode()}")
tinygrad.device.CompileError: OpenCL Compile Error

<kernel>:2:66: error: unknown type name '__bf16'
__kernel void E_131072_32_4(__global half* data0, const __global __bf16* data1) {
                                                                 ^
<kernel>:6:3: error: use of undeclared identifier '__bf16'
  __bf16 val0 = data1[alu0+1];
  ^
<kernel>:7:3: error: use of undeclared identifier '__bf16'
  __bf16 val1 = data1[alu0+2];
  ^
<kernel>:8:3: error: use of undeclared identifier '__bf16'
  __bf16 val2 = data1[alu0+3];
  ^
<kernel>:9:3: error: use of undeclared identifier '__bf16'
  __bf16 val3 = data1[alu0];
  ^
<kernel>:10:53: error: use of undeclared identifier 'val3'; did you mean 'all'?
  *((__global half4*)(data0+alu0)) = (half4)((half)(val3),(half)(val0),(half)(val1),(half)(val2));
                                                    ^~~~
                                                    all
cl_kernel.h:7285:22: note: 'all' declared here
int __OVERLOADABLE__ all(long16 in);
                     ^
<kernel>:10:52: error: pointer cannot be cast to type 'half'
  *((__global half4*)(data0+alu0)) = (half4)((half)(val3),(half)(val0),(half)(val1),(half)(val2));
                                                   ^~~~~~

You can use environment variable SUPPORT_BF16=0. Or, change the function fix_bf16 in llama.py to "if getenv("SUPPORT_BF16", 0):"

exo-explore / exo

Having trouble running on ubuntu linux with 4090 (cuda 12.2) #155