exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.56k stars 342 forks source link

Segmentation fault(Core dumped) in tinygrad #180

Open HysenX-LI opened 2 weeks ago

HysenX-LI commented 2 weeks ago

I run the project on the Ubuntu18.04 and get a Segmentation fault (Core dumped) error. With using "DEBUG=9 python -X faulthandler main.py“, I got the following error message. Is the llvmlite module required? It's not in setup.py, but it needs to be used by tinygrad. Is there any way to solve this problem? Thank you

Fatal Python error: Segmentation fault Current thread 0x00007f422dfeb740 (most recent call first): File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/runtime/ops_llvm.py", line 54 in File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/helpers.py", line 285 in cpu_time_execution File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/runtime/ops_llvm.py", line 54 in call File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 104 in call File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 173 in run File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 223 in run_schedule File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/tensor.py", line 204 in realize File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/tensor.py", line 3256 in _wrapper File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129 in load_state_dict File "/usr1/Project/exo/exo/inference/tinygrad/inference.py", line 51 in build_transformer File "/usr1/Project/exo/exo/inference/tinygrad/inference.py", line 96 in ensure_shard File "/usr1/Project/exo/exo/inference/tinygrad/inference.py", line 60 in infer_prompt File "/usr1/Project/exo/exo/orchestration/standard_node.py", line 140 in _process_prompt File "/usr1/Project/exo/exo/orchestration/standard_node.py", line 102 in process_prompt File "/usr1/Project/exo/exo/api/chatgpt_api.py", line 308 in handle_post_chat_completions File "/usr1/Project/exo/exo/api/chatgpt_api.py", line 253 in middleware File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/aiohttp/web_middlewares.py", line 114 in impl File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/aiohttp/web_app.py", line 537 in _handle File "/root/anaconda3/envs/EXO/lib/python3.12/site-packages/aiohttp/web_protocol.py", line 459 in _handle_request File "/root/anaconda3/envs/EXO/lib/python3.12/asyncio/events.py", line 88 in _run File "/root/anaconda3/envs/EXO/lib/python3.12/asyncio/base_events.py", line 1987 in _run_once File "/root/anaconda3/envs/EXO/lib/python3.12/asyncio/base_events.py", line 641 in run_forever File "/root/anaconda3/envs/EXO/lib/python3.12/asyncio/base_events.py", line 674 in run_until_complete File "/usr1/Project/exo/main.py", line 132 in

Here is the result of a partial stack call to GDB,

0 0x0000000000000000 in ?? ()

1 0x00007fffea8ec043 in E_4194304_4 ()

2 0x00007ffff6b90052 in ffi_call_unix64 () from /root/anaconda3/envs/exo/lib/python3.12/lib-dynload/../../libffi.so.8

3 0x00007ffff6b8e925 in ffi_call_int () from /root/anaconda3/envs/exo/lib/python3.12/lib-dynload/../../libffi.so.8

4 0x00007ffff6b8f06e in ffi_call () from /root/anaconda3/envs/exo/lib/python3.12/lib-dynload/../../libffi.so.8

5 0x00007fffeb2db7b7 in _call_function_pointer (argtypecount=, argcount=2, resmem=0x7fffffffcfc0, restype=, atypes=, avalues=,

pProc=0x7fffea8ec000 <E_4194304_4>, flags=<optimized out>) at /croot/python-split_1715024085344/work/build-static/stgdict.c:931

6 _ctypes_callproc (pProc=0x7fffea8ec000 , argtuple=0x7fffe463bb80, flags=, argtypes=0x7fffe45f55c0, restype=, checker=0x0)

at /croot/python-split_1715024085344/work/build-static/stgdict.c:1273

7 0x00007fffeb2e595a in PyCFuncPtr_call () at :4167

8 0x000000000055afc5 in _PyObject_Call.localalias () at /croot/python-split_1715024085344/_build_env/x86_64-conda-linux-gnu/sysroot/usr/include/bits/pycore_pyerrors.h:367

9 0x0000000000529d7a in PyCFunction_Call (kwargs=,

args=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, callable=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
at /croot/python-split_1715024085344/_build_env/x86_64-conda-linux-gnu/sysroot/usr/include/bits/pycore_pyerrors.h:387

10 _PyEval_EvalFrameDefault () at /usr/local/src/conda/python-3.12.3/Programs/opcode_targets.h:3254

AlexCheema commented 2 weeks ago

Does it work when you pip install llvmlite?

HysenX-LI commented 2 weeks ago

This error occurred after the llvmlit installation. If I didn't install llvmlite, the project would have reported an error on previous execution because tinygrad needed to call the library

HysenX-LI commented 2 weeks ago

I'm not sure it's the tinygrad and llvmlite library running on Linux. Because the problem seems to be related to their unsolved functionality. numba/llvmlite#1075 and tinygrad/tinygrad#1367.

HysenX-LI commented 2 weeks ago

The numba/llvmlite#1075 issue I mentioned above was an error I got running the project after delete the part of the helper.py file in tinygrad that generated the error.("cb()" in "cpu_time_execution" function). But I think both problems are caused by tinygrad's lack of support for certain Linux instructions. Have you ever run a project on Linux and what is the environment.