elixir-nx / nx

Multi-dimensional arrays (tensors) and numerical definitions for Elixir
2.66k stars 194 forks source link

"module EXLA.NIF is not available" What does this mean? #1546

Closed severian1778 closed 3 weeks ago

severian1778 commented 1 month ago

image

Hi. I am not a genius like Jose Valim. I do not have 10 years in Docker experience. I am just a statistician and cannot spend days of my life learning how to compile things.

Can someone give a clear and concise, plain english description on what is happening here? It has destroyed my app and I can no longer run it after upgrading.

seanmor5 commented 1 month ago

@severian1778 is it possible for you to share the Dockerfile? Are you running on CPU or GPU? What versions? The IPC stuff is a relatively recent addition

polvalente commented 1 month ago

Are you using Docker? Can you share the Dockerfile by any chance?

This is most likely related to a CUDA flag we have in the build process.

Also, which version of Nx and EXLA are you running?

severian1778 commented 1 month ago

@seanmor5 I am not using a docker, I have a computer at my house with ubuntu on it. I am a simple mathematician, not an advanced computer scientist. But looking to learn.

EXLA/NX version:
image

GPU/CUDA version image

I just need to get my app running again and I am permenantly stuck with this error which is rough!

polvalente commented 1 month ago

Could you rm -rf the cache and _build directories and rebuild?

severian1778 commented 1 month ago

@polvalente As far as I understand the cache is in the umbrella root's _build directory? I did remove this as suggested and peformed mix deps.compile and it simply returned the same error again.

polvalente commented 1 month ago

I think the cache might be under the deps/exla directory instead. The hypothesis here is that on upgrading you kept outdated .o files that didn't recompile

seanmor5 commented 1 month ago

So there are a few things that could be happening here. The symbol should be present in a shared object library called libcudart.so. One thing to try is to first find the directory this shared object is in (typically it's like usr/lib/cuda or similar) and then add that directory to the LD_LIBRARY_PATH environment variable.

It's also possible this is a build issue from recent changes, but I haven't looked into this enough yet to know if that's the case

Another thing that will probably work as a temporary fix is that you can downgrade to EXLA 0.8 and it should work.

Also worth nothing that you will probably want to use both Nx and EXLA from the same version to avoid any issues

severian1778 commented 1 month ago

@polvalente @seanmor5 ok thank you that is helpful guidance. I will try to use this to fix things up and report back. Thank you very much for the guidance.

severian1778 commented 1 month ago

@seanmor5 sadly I think the build is broken Note:

image

and the path is clearly added in .bashrc

image

Downgrading to version 0.6.0 worked. From my research it seems that the libexla.so in the priv folder is somehow asking for that ipc symbol but for some reason it can't seem to resolve it when I point it to to the shared lib via LD_LIBRARY_PATH.

I know that the file contains the symbol via nm command, but the init function in EXLA.NIF is puking because the symbol cant be resolved. Pardon if I seem basic, I am not a real computer scientist, so its all new to me.

polvalente commented 4 weeks ago

What does nvcc --version output for you?

polvalente commented 4 weeks ago

@severian1778

If nvcc shows 12.1 or later, that's fine. Otherwise, you should upgrade CUDA to that and libcudnn to version 9. I suspect you're somehow using CUDA 11, which isn't supported by newer versions anyway.

Also, look for a message like "Using libexla.so from #{cached_so}" in the mix compile logs. You should delete that file, the deps and _build directories, and then upgrade the EXLA version. This should ensure that you're starting from a clean slate.

If the problem persists, we can investigate linking/path issues.

severian1778 commented 4 weeks ago

@polvalente that makes sense I will try this guidance. I suspected there was a disconnect between Cuda files and libexla.so and this seems like a clear solution. In the future would it be appropriate to have better error messaging that reflects minimum cuda release?

polvalente commented 4 weeks ago

The minimum CUDA version is already pretty clear in XLA, which is pointed to from EXLA.

Checking at compile time would be possible, but given that the supported CUDA version is given indirectly by openxla/xla via elixir-nx/xla, there isn't much we can do. However, if you have a suggestion on how to use openxla/xla's checks at EXLA's compile time, we can converge on a plan of attack and PRs are always welcome.

severian1778 commented 4 weeks ago

@polvalente I just updated to the most modern cuda version and still the same problem. I guess the problem is that its hard to understand what libexla.so is looking for. It just refuses to grab the symbol from libcudart.so. I am totally at a loss. I was operating on version 12.2 as the image post defines, so I think its not the cuda version that is the problem. although I did upgrade to 12.6

polvalente commented 4 weeks ago

Have you deleted the build and cache directories as suggested? Please do that and share the resulting logs.

What I'm looking to see is which XLA version is being pulled and whether libexla.so is getting recompiled.

Also, please share the output for nvcc --version

severian1778 commented 4 weeks ago

@polvalente

current version. image

I did delete _build and deps folder in umbrella root. image

result of log file image

perhaps 0.8.0 is wrong?

polvalente commented 4 weeks ago

The versions are correct. I mean I want to see the logs for mix compile or whichever command triggers the first compilation after build.

I suspect the global cache for EXLA got built with some improper lingering .o file and your build is still using that.

severian1778 commented 4 weeks ago

@polvalente oh gotcha sorry. I think I got it. The issue was indeed a file in the .cache folder for exla in ubuntu home folder, there was a precompiled cached version of exla and I eliminated it and recompiled and now all is well. What I did not understand was that there was a .cache folder not in the umbrella set up by hex! I did not install Elixir on this computer but now I understand.

Thank you @polvalente for sticking with me on this one, you are a scholar and a gentleman.

for any other newbs ->

image

This is the cache folder.

Can close this one up and chalk it up to the foolery of the OP :D

polvalente commented 4 weeks ago

Awesome! It was actually @seanmor5 that reminded me that this global cache actually exists. I wonder if it really is necessary, given that we sped up the compilation a bit.

@josevalim WDYT? Maybe we could make the cache opt-in instead so people with slower computers can use it? I consider this because more than once this cache bit me like it bit @severian1778, especially during development.

josevalim commented 4 weeks ago

The global cache is still important, for example, to not compile twice between dev/test. Are we using the current EXLA version in the global cache? Because if not, that's what I would do and that should solve it.

josevalim commented 4 weeks ago

Or do we need to include the nvcc version in the cache or something?

polvalente commented 4 weeks ago

I believe what happened is that while building the global cache, which I think does take the EXLA version into account, a stale .o got included. This probably has something to do with how Make recognizes chances in dependency files. That could be fixed by adding the EXLA version to the .o output path.

I think it doesn't compile twice between dev and test due to it being present in exla/cache already.

severian1778 commented 4 weeks ago

@josevalim I think just a nice error message there with the nvcc version and location of offending file and plain english wording "the compiled .so file is not compatible with nvcc ver.xx, please consider deleting the cached version and re-compiling with suggested nvcc v.xx and up" would have led me to solve the problem without the thread.

polvalente commented 4 weeks ago

The issue here wasn't the CUDA version to begin with. It was a lingering .o file from EXLA that didn't recompile properly.

We can certainly add some message that points to a troubleshooting section present either in the README or in the docs, and write one. This should appear both in compilation failures as well as NIF loading failures.

polvalente commented 4 weeks ago

Suggested points for us to add in a troubleshooting section: