elixir-nx / nx

Multi-dimensional arrays (tensors) and numerical definitions for Elixir
2.66k stars 194 forks source link

RuntimeError: Symbol main.27 not found. #1532

Closed thiagopromano closed 1 month ago

thiagopromano commented 2 months ago

Hey,

We have been getting this error sometimes when calling a Defn in production, retrying immediately always solves the problem.

We are currently running exla 0.7.3 on the host platform. We have not yet updated to 0.8.0 as we depend on Bumblebee which doesn't support it yet.

I could not reproduce it locally as it happens randomly, retrying always works. We obtained this error 37 times out of (estimated) ~100k executions.

The same error also presents itself with the symbols main.29 and main.5.

Here is the stack trace:

RuntimeError: Symbol main.27 not found.
  File "lib/exla/mlir/module.ex", line 127, in EXLA.MLIR.Module.unwrap!/1
  File "lib/exla/mlir/module.ex", line 113, in EXLA.MLIR.Module.compile/5
  File "timer.erl", line 590, in :timer.tc/2
  File "lib/exla/defn.ex", line 599, in anonymous fn/12 in EXLA.Defn.compile/8
  File "lib/exla/mlir/context_pool.ex", line 10, in anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
  File "lib/nimble_pool.ex", line 462, in NimblePool.checkout!/4
  File "lib/exla/defn/locked_cache.ex", line 36, in EXLA.Defn.LockedCache.run/2
  File "timer.erl", line 590, in :timer.tc/2
  File "lib/exla/defn.ex", line 555, in EXLA.Defn.compile/8
  File "lib/exla/defn.ex", line 400, in EXLA.Defn.__compile__/4
  File "lib/exla/defn.ex", line 385, in EXLA.Defn.__jit__/5
  File "lib/nx/defn.ex", line 452, in Nx.Defn.do_jit_apply/3
josevalim commented 1 month ago

Please let us know if you have better results either in 0.8 or 0.9.

thiagopromano commented 1 month ago

We were able to stop receiving this error by decreasing our application's peak memory usage. It's likely that the error was caused by something important being evicted during periods of high memory usage.

In our situation, we achieved this by rewriting a costly algorithm into a NIF written in Rust with several optimizations, such as using sparse matrices.

If anyone encounters this issue, I suggest checking the memory usage of your application.