elixir-nx / nx_iree

Elixir and Nx bindings for the IREE runtime and compiler
23 stars 2 forks source link

Fail to install nx_iree to M3 Max #3

Open zacky1972 opened 1 month ago

zacky1972 commented 1 month ago

Hi,

I'm strongly interested in nx_iree.

I'm trying to run it on M3 Max, but I found an error to build it.

Steps to reproduce:

  1. brew install ninja python@3.12 cmake
  2. Install and set up Erlang 27.0 by asdf
  3. Install and set up Elixir 1.17.2 by asdf
  4. Run iex with Mix.install([{:nx_iree, "~> 0.1", git: "https://github.com/elixir-nx/nx_iree"}])
  5. Then I got the following error:
Mix.install([{:nx_iree, "~> 0.1", git: "https://github.com/elixir-nx/nx_iree"}])
==> nx_iree
cmake -G Ninja -B /Users/zacky/Library/Caches/mix/installs/elixir-1.17.2-erts-15.0/136ad778f756bdbc609247d027fb6d5d/deps/nx_iree/iree-runtime/iree-build \
        -DCMAKE_BUILD_TYPE=Release\
        -DIREE_BUILD_COMPILER=OFF\
        -DIREE_RUNTIME_BUILD_DIR=/Users/zacky/Library/Caches/mix/installs/elixir-1.17.2-erts-15.0/136ad778f756bdbc609247d027fb6d5d/deps/nx_iree/iree-runtime/build\
        -DIREE_RUNTIME_INCLUDE_PATH=/Users/zacky/.cache/nx_iree/iree-candidate-20240604.914/runtime/src/iree\
        -DIREE_DIR=/Users/zacky/.cache/nx_iree/iree-candidate-20240604.914 \
        -S cmake
CMake Deprecation Warning at /Users/zacky/.cache/nx_iree/iree-candidate-20240604.914/CMakeLists.txt:14 (cmake_policy):
  The OLD behavior for policy CMP0116 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- Could not find nvcc, please set CUDAToolkit_ROOT.
-- IREE HAL drivers:
--   - hip
--   - local-sync
--   - local-task
--   - metal
-- IREE HAL local executable library loaders:
--   - embedded-elf
-- IREE HAL local executable plugin mechanisms:
--   - embedded-elf
--   - system-library
The git submodule 'third_party/spirv_cross' is not initialized. Please run `git submodule update --init`
CMake Error at /Users/zacky/.cache/nx_iree/iree-candidate-20240604.914/CMakeLists.txt:714 (message):
  check_submodule_init.py failed, see the logs above

-- Configuring incomplete, errors occurred!
make: *** [/Users/zacky/Library/Caches/mix/installs/elixir-1.17.2-erts-15.0/136ad778f756bdbc609247d027fb6d5d/deps/nx_iree/iree-runtime/host/install] Error 1
could not compile dependency :nx_iree, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile nx_iree --force", update it with "mix deps.update nx_iree" or clean it with "mix deps.clean nx_iree"
** (Mix.Error) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. Try running the
commands "gcc --version" and / or "make --version". If these programs
are not installed, you will be prompted to install them.

    (mix 1.17.2) lib/mix.ex:588: Mix.raise/2
    (elixir_make 0.8.4) lib/elixir_make/compiler.ex:53: ElixirMake.Compiler.compile/1
    (mix 1.17.2) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5
    (mix 1.17.2) lib/mix/tasks/compile.all.ex:108: Mix.Tasks.Compile.All.run_compiler/2
    (mix 1.17.2) lib/mix/tasks/compile.all.ex:88: Mix.Tasks.Compile.All.compile/4
    (mix 1.17.2) lib/mix/tasks/compile.all.ex:62: Mix.Tasks.Compile.All.run/1
    (mix 1.17.2) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5
polvalente commented 1 month ago

Do you by any chance have an nvcc alias in your environment?

zacky1972 commented 1 month ago
$ which nvcc
nvcc not found
polvalente commented 1 month ago

This is weird. You could try removing the /Users/zacky/.cache/nx_iree/ directory to force re-cloning of iree. Maybe something went wrong there, and re-running is getting an incomplete state of the repo.

polvalente commented 1 month ago

The main branch will now download precompiled artifacts from the 0.0.1-pre.2 release. You might be able to try things out using it :)

zacky1972 commented 1 month ago

I tried it as follows, but I got a segmentation fault:

  1. brew install ninja python@3.12 cmake
  2. Install and set up Erlang 27.0 by asdf
  3. Install and set up Elixir 1.17.2 by asdf
  4. Run iex with Mix.install([{:nx_iree, "~> 0.0.1-pre.2", git: "https://github.com/elixir-nx/nx_iree"}])
  5. Then I got the following log and a segmentation fault:
iex(1)> Mix.install([{:nx_iree, "~> 0.0.1-pre.2", git: "https://github.com/elixir-nx/nx_iree"}])
* Getting nx_iree (https://github.com/elixir-nx/nx_iree)
remote: Enumerating objects: 358, done.        
remote: Counting objects: 100% (126/126), done.        
remote: Compressing objects: 100% (79/79), done.        
remote: Total 358 (delta 68), reused 75 (delta 38), pack-reused 232        
origin/HEAD set to main
Resolving Hex dependencies...
Resolution completed in 0.026s
New:
  complex 0.5.0
  elixir_make 0.8.4
  nx 0.7.3
  telemetry 1.2.1
* Getting elixir_make (Hex package)
* Getting nx (Hex package)
* Getting complex (Hex package)
* Getting telemetry (Hex package)
===> Analyzing applications...
===> Compiling telemetry
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 35 files (.ex)
Generated nx app
==> elixir_make
Compiling 8 files (.ex)
Generated elixir_make app
Downloading NxIREE NIFs from https://github.com/elixir-nx/nx_iree/releases/download/v0.0.1-pre.2/libnx_iree-macos-aarch64-nif-2.17.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                                                                                                Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  711k  100  711k    0     0  1003k      0 --:--:-- --:--:-- --:--:-- 1003k
                                                                              ==> nx_iree
Compiling 7 files (.ex)
Generated nx_iree app
zsh: segmentation fault  iex
polvalente commented 1 month ago

Could you run it with lldb?

iex(1)> System.pid()

this will get you the host OS PID which you can pass onto a different shell to lldb: lldb --attach-pid <pid>, and then you can continue and do Mix.install normally.

When you get the segfault, lldb should show an error, and then bt will show the backtrace which will tell us where the segfault is happening.

zacky1972 commented 1 month ago

I got the following backtrace:

Target 0: (beam.smp) stopped.
(lldb) bt
* thread #17, name = 'erts_sched_13', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000120605ae8 libnx_iree_runtime.so`iree_event_pool_free + 24
    frame #1: 0x0000000120601e10 libnx_iree_runtime.so`iree_task_executor_destroy + 204
    frame #2: 0x0000000120601d10 libnx_iree_runtime.so`iree_task_executor_create + 908
    frame #3: 0x0000000120601450 libnx_iree_runtime.so`iree_task_executors_create_from_flags + 528
    frame #4: 0x00000001205fd504 libnx_iree_runtime.so`iree_hal_local_task_driver_factory_try_create + 172
    frame #5: 0x0000000120621e40 libnx_iree_runtime.so`iree_hal_driver_registry_try_create + 240
    frame #6: 0x0000000120621efc libnx_iree_runtime.so`iree_hal_create_device + 104
    frame #7: 0x00000001205ecbfc libnx_iree_runtime.so`list_devices(iree_hal_driver_registry_t*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::vector<iree::runtime::Device*, std::__1::allocator<iree::runtime::Device*>>&) + 1196
    frame #8: 0x0000000104b3cf68 libnx_iree.so`list_devices(enif_environment_t*, int, unsigned long const*) + 304
    frame #9: 0x0000000104380688 beam.smp`beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*) + 100
    frame #10: 0x0000000106ca4a7c
pklonowski commented 1 month ago

Same seg fault issue (also M3 Max). Tried with elixir 1.17.2-otp-27 + erlang 27.0.1; elixir 1.17.2-otp-26 + erlang 26.2.5, and elixir 1.16.3-otp-26 + erlang 26.2.5.

jaman commented 1 month ago

I get the same segmentation fault as well. The one thing I'd note is that it works in livebook:

Interactive Elixir (1.17.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Mix.install(
...(1)>   [
...(1)>     {:exla, "~> 0.7.3"},
...(1)>     {:nx_iree, github: "elixir-nx/nx_iree"},
...(1)>   ]
...(1)> )
zsh: segmentation fault  iex

livebook:

drivers: {:ok,
 %{
   "local-sync" => "Local execution using a lightweight inline synchronous queue",
   "local-task" => "Local execution using the IREE multithreading task system",
   "metal" => "Apple Metal"
 }}
{:ok, ["local-sync://default", "local-sync://"]}
{:ok,
 [
   #Nx.Tensor<
     f32[4]
     NxIREE.Tensor(local-sync://default)
     [1.3817732334136963, -1.257617712020874, -0.1485215425491333, -1.4951145648956299]
   >
 ]}
#Nx.Tensor<
  f32[4]
  NxIREE.Tensor(local-sync://default)
  [1.3817732334136963, -1.257617712020874, -0.1485215425491333, -1.4951145648956299]
>
polvalente commented 3 weeks ago

If anyone that can reproduce this could ping me on the EEF Slack, we can try and debug this together. This most likely has something to do with initialization race conditions or something like that.

I unfortunately can't reproduce this locally.

jaman commented 3 weeks ago

Stack trace for me:

* thread #5, name = 'erts_sched_1', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000128e55ae8 libnx_iree_runtime.so`iree_event_pool_free + 24
    frame #1: 0x0000000128e51e10 libnx_iree_runtime.so`iree_task_executor_destroy + 204
    frame #2: 0x0000000128e51d10 libnx_iree_runtime.so`iree_task_executor_create + 908
    frame #3: 0x0000000128e51450 libnx_iree_runtime.so`iree_task_executors_create_from_flags + 528
    frame #4: 0x0000000128e4d504 libnx_iree_runtime.so`iree_hal_local_task_driver_factory_try_create + 172
    frame #5: 0x0000000128e71e40 libnx_iree_runtime.so`iree_hal_driver_registry_try_create + 240
    frame #6: 0x0000000128e71efc libnx_iree_runtime.so`iree_hal_create_device + 104
    frame #7: 0x0000000128e3cbfc libnx_iree_runtime.so`list_devices(iree_hal_driver_registry_t*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::vector<iree::runtime::Device*, std::__1::allocator<iree::runtime::Device*>>&) + 1196
    frame #8: 0x000000010b9acf68 libnx_iree.so`list_devices(enif_environment_t*, int, unsigned long const*) + 304
    frame #9: 0x000000010501fc30 beam.smp`beam_jit_call_nif(c_p=0x000000010bf09940, I=<unavailable>, reg=0x000000016b06ee40, fp=(libnx_iree.so`list_devices(enif_environment_t*, int, unsigned long const*)), NifMod=<unavailable>) at beam_jit_common.cpp:643:26 [opt]
    frame #10: 0x000000010795ca7c
zacky1972 commented 3 weeks ago

@polvalente I see. I hypothesize that this issue may occur only on M3 Max. I'll test another Apple Silicon.

zacky1972 commented 3 weeks ago

I tested the script Mix.install([{:nx_iree, "~> 0.0.1-pre.2", git: "http://github.com/elixir-nx/nx_iree"}], system_env: [NX_IREE_PREFER_PRECOMPILED: "false"]) on both M2 and M3 Max.

My hypothesis is wrong. On M2 Max, a segmentation fault occurred. But, on M3 Max, succeeded.

zacky1972 commented 3 weeks ago

On M2, no segmentation fault occurred by the script, even when the first time.

zacky1972 commented 2 weeks ago

I've tested the new version, but I got the following error:

01:30:29.317 [error] Bad input fd in erts_poll()! fd=0, resource={prim_tty,tty}

:ok

01:30:29.318 [notice] Application nx_iree exited: exited in: NxIREE.Application.start(:normal, [])
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, ~c"Failed to execute IREE runtime due to error: iree-candidate-20240818.989/runtime/src/iree/base/internal/wait_handle_posix.c:54: RESOURCE_EXHAUSTED; failed to create pipe (24); creating driver for device 'local-task://'"}
            (nx_iree 0.0.1-pre.4) lib/nx_iree/device.ex:15: anonymous fn/2 in NxIREE.Device.init/0
            (elixir 1.17.2) lib/enum.ex:4353: Enum.flat_map_list/2
            (elixir 1.17.2) lib/enum.ex:4354: Enum.flat_map_list/2
            (nx_iree 0.0.1-pre.4) lib/nx_iree/device.ex:14: NxIREE.Device.init/0
            (nx_iree 0.0.1-pre.4) lib/nx_iree/application.ex:9: NxIREE.Application.start/2
            (kernel 10.0.1) application_master.erl:295: :application_master.start_it_old/4