Pioneering with Fedora 40 beta...

hickscorp commented 3 months ago

Hi!

I am trying to pioneer a bit - I got a computer which hardware is only supported in Fedora 40 beta. Installing Cuda / Cudnn is an an absolute pain, so I'll explain how I've done it in case it's helpful for some folks, and in case some of what I've done is wrong (See the end of the post).

I'm trying to compile an Elixir application, and I'm trying to build EXLA lib from source (Otherwise I've got a Cuda version mismatch (8.0 on my system while the precompiled binaries were linked against / want 8.9).

I'm getting this output:

==> xla
Compiling 2 files (.ex)
Generated xla app
rm -f /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        ln -s "/home/doodloo/Documents/Professional/IJI/essify-ai-api/deps/xla/extension" /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
        cd /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
        bazel build --define "framework_shared_object=false" -c opt   --config=cuda --action_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_52,sm_60,sm_70,compute_80" //xla/extension:xla_extension && \
        mkdir -p /home/doodloo/.cache/xla/0.6.0/cache/build/ && \
        cp -f /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/bazel-bin/xla/extension/xla_extension.tar.gz /home/doodloo/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cuda.tar.gz
INFO: Reading 'startup' options from /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Found applicable config definition build:short_logs in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:cuda in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
INFO: Found applicable config definition build:linux in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
INFO: Repository local_config_cuda instantiated at:
  /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/WORKSPACE:19:15: in <toplevel>
  /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/workspace2.bzl:90:19: in workspace
  /home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/workspace2.bzl:626:19: in workspace
  /home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/workspace2.bzl:74:19: in _tf_toolchains
Repository rule cuda_configure defined at:
  /home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl:1205:33: in <toplevel>
DEBUG: /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'llvm-raw' because it already exists.
ERROR: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 1170, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 772, column 35, in _create_local_cuda_repository
                cuda_config = _get_cuda_config(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 513, column 30, in _get_cuda_config
                config = find_cuda_config(repository_ctx, ["cuda", "cudnn"])
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 491, column 26, in find_cuda_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_cuda_config] + cuda_libraries)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.3/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'
ERROR: /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/WORKSPACE:19:15: fetching cuda_configure rule //external:local_config_cuda: Traceback (most recent call last):
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 1170, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 772, column 35, in _create_local_cuda_repository
                cuda_config = _get_cuda_config(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 513, column 30, in _get_cuda_config
                config = find_cuda_config(repository_ctx, ["cuda", "cudnn"])
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 491, column 26, in find_cuda_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_cuda_config] + cuda_libraries)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.3/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'
INFO: Found applicable config definition build:cuda in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.3/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'

make: *** [Makefile:26: /home/doodloo/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cuda.tar.gz] Error 2

Any idea please?

This is how I got Cuda / Cudnn installed on Fedora 40 beta:

Some of this taken from this page on RPM Fusion.

# Add CUDA repo.
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora37/x86_64/cuda-fedora37.repo
sudo dnf clean all
# Disable the official nvidia-driver package, so it doesn't get installed by mistake.
sudo dnf module disable nvidia-driver
# Install cuda!
sudo dnf -y install cuda cuda-toolkit-12

# Also add the CudNN repo...
sudo dnf install https://developer.download.nvidia.com/compute/machine-learning/repos/rhel8/x86_64/nvidia-machine-learning-repo-rhel8-1.0.0-1.x86_64.rpm
# Install CudNN and friends.
sudo dnf install libcudnn8 libcudnn8-devel libnccl libnccl-devel

# Install Bazel...
dnf copr enable vbatts/bazel
dnf install bazel5 # <---- Important, won't work with Bazel 4...

# Later when compiling the EXLA dependency, don't forget to set `XLA_TARGET=cuda`
# and `XLA_BUILD=true` to force a build from source. Something like:
rm -rf deps/xla
XLA_TARGET="cuda" XLA_BUILD=true mix do deps.get, deps.compile

polvalente commented 3 months ago

I believe you should be using XLA_TARGET=cuda120, instead of just cuda, which seems to be the root cause of the empty string here: Could not find any cuda.h matching version '' in any subdirectory:

polvalente commented 3 months ago

You might need to downgrade CUDA, seeing that you have 12.3, although I don't expect this to be a problem. Just something to keep in mind!

jonatanklosko commented 3 months ago

sudo dnf install https://developer.download.nvidia.com/compute/machine-learning/repos/rhel8/x86_64/nvidia-machine-learning-repo-rhel8-1.0.0-1.x86_64.rpm

Is there a reason you couldn't use newer cuDNN? I'm pretty sure 8.1 is not going to fly, especially with CUDA 12, which is why we precompile with higher minimal versions.

polvalente commented 3 months ago

Look for a download here: https://developer.nvidia.com/rdp/cudnn-archive

jonatanklosko commented 3 months ago

Also, to be sure, can you check exactly what version of libcudnn8 gets installed by the package manager?

I believe you should be using XLA_TARGET=cuda120, instead of just cuda

When building from source cuda should be fine.

hickscorp commented 3 months ago

Thanks both!

Will answer your question ine go:

1) Using `cuda120` instead of `cuda`...

@polvalente using cuda120 as target yields the same problem.

2) Downgrading...

@polvalente downgrade to which version?

It's an absolute mess with Fedora... Nvidia is not pushing many packages for Fedora 39 (See packages for F37 vs packages for F39)... And then I'd need to find a compatible / suitable nvidia-machine-learning-repo for cudnn - have a peek here it's even more of a mess :) Currently using this one.

Can you advise on what do you think I could downgrade to?

3) Archive RPM...

Will have a look - thanks @polvalente .

4) Version of `libcudnn8`...

Currently it seems that I have 8.0.4.30-1.cuda11.1. Damn.

Next Steps

Ok so all in all, I could use some advice on what pairs of versions to use. I will sanitize the system, remove all versions, and start over. I'd appreciate some advices here, and I'll document everything as it seems that it could be useful for the community.

Thanks!

polvalente commented 3 months ago

@polvalente downgrade to which version?

12.0 or 12.1, but you might be able to get 12.3 to work.

Your cudnn8 is definitely incompatible with your CUDA version, though. I think the link I shared should help.

hickscorp commented 3 months ago

@polvalente should I go for https://developer.download.nvidia.com/compute/cuda/repos/fedora37/x86_64/cuda-fedora37.repo or https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/ ?

EDIT: Asking this because as you see - the F37 repo gives more Cuda version options... F39 repo only has 12.4. Down the road, I need to later be able to match Cudnn etc.

polvalente commented 3 months ago

I have zero experience with RHEL and Fedora, so take this with a grain of salt, but there seems to be a generic linux x86_64 installer under Local Installers for Windows and Linux in each section for cuDNN, so I'd expect something similar to be available for CUDA here: https://developer.nvidia.com/cuda-toolkit-archive

hickscorp commented 3 months ago

All right. Here's an update.

I've tried getting better matching versions. Still very similar error - or even identical.

R: An error occurred during the fetch of repository 'local_config_cuda':
   Traceback (most recent call last):
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 1170, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 772, column 35, in _create_local_cuda_repository
                cuda_config = _get_cuda_config(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 513, column 30, in _get_cuda_config
                config = find_cuda_config(repository_ctx, ["cuda", "cudnn"])
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 491, column 26, in find_cuda_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_cuda_config] + cuda_libraries)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.1/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'
ERROR: /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/WORKSPACE:19:15: fetching cuda_configure rule //external:local_config_cuda: Traceback (most recent call last):
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 1170, column 38, in _cuda_autoconf_impl
                _create_local_cuda_repository(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 772, column 35, in _create_local_cuda_repository
                cuda_config = _get_cuda_config(repository_ctx)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 513, column 30, in _get_cuda_config
                config = find_cuda_config(repository_ctx, ["cuda", "cudnn"])
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/gpus/cuda_configure.bzl", line 491, column 26, in find_cuda_config
                exec_result = execute(repository_ctx, [python_bin, repository_ctx.attr._find_cuda_config] + cuda_libraries)
        File "/home/doodloo/.cache/bazel/_bazel_doodloo/0c1c6082f009bb3ae50dd6cba240b936/external/tsl/third_party/remote_config/common.bzl", line 230, column 13, in execute
                fail(
Error in fail: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.1/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'
INFO: Found applicable config definition build:cuda in file /home/doodloo/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
Could not find any cuda.h matching version '' in any subdirectory:
        ''
        'include'
        'include/cuda'
        'include/*-linux-gnu'
        'extras/CUPTI/include'
        'include/cuda/CUPTI'
        'local/cuda/extras/CUPTI/include'
        'targets/x86_64-linux/include'
of:
        '/lib64'
        '/usr'
        '/usr/lib64/iscsi'
        '/usr/lib64/llvm17/lib'
        '/usr/lib64/pipewire-0.3/jack'
        '/usr/local/cuda'
        '/usr/local/cuda-12.1/targets/x86_64-linux/lib'
        '/usr/local/cuda/targets/x86_64-linux/lib'

make: *** [Makefile:26: /home/doodloo/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cuda120.tar.gz] Error 2

Here are the repositories I use now:

For CUDA / CUDA Toolkit from this repo and installed cuda-12.1.1 / cuda-toolkit-12.1.1.
I picked libcudnn8-8.9.7.29-1.cuda12.2 / libcudnn8-devel-8.9.7.29-1.cuda12.2 from the suggested archive.
Then for libnccl-2.8.3-1+cuda11.2 / libnccl-devel-2.8.3-1+cuda11.2 I picked from this repo as from the suggested archive.

Hit me :)

hickscorp commented 3 months ago

Oh and FYI:

> file /usr/local/cuda-12.1/include/cuda.h
/usr/local/cuda-12.1/include/cuda.h: C source, Unicode text, UTF-8 text

> file /usr/local/cuda-12.1/targets/x86_64-linux/include/cuda.h 
/usr/local/cuda-12.1/targets/x86_64-linux/include/cuda.h: C source, Unicode text, UTF-8 text

Some versions:

> dnf list installed "cuda*"
Installed Packages
cuda.x86_64                                              12.1.1-1                      @cuda-fedora37-x86_64
... some more packages...
cuda-toolkit-12-1.x86_64                                 12.1.1-1                      @cuda-fedora37-x86_64
... some more packages...

> dnf list installed "libcudnn*"
Installed Packages
libcudnn8.x86_64                          8.9.7.29-1.cuda12.2                    @cudnn-local-rhel9-8.9.7.29
libcudnn8-devel.x86_64                    8.9.7.29-1.cuda12.2                    @cudnn-local-rhel9-8.9.7.29

> dnf list installed "libnccl*"
Installed Packages
libnccl.x86_64                                          2.8.3-1+cuda11.2                                    @nvidia-machine-learning
libnccl-devel.x86_64                                    2.8.3-1+cuda11.2                                    @nvidia-machine-learning

Yes, for the love of me I can't seem to match all versions. Either it's cuda, or cudnn, or libnccl. Now it's libnccl. But to compile XLA, libnccl shouldn't be that relevant - should it?

jonatanklosko commented 3 months ago

@hickscorp with the new versions, did you try using one of the precompiled binaries? That is running without XLA_BUILD, just XLA_TARGET=cuda120.

Btw. if everything fails, Docker may be a way to get things running.

hickscorp commented 3 months ago

@jonatanklosko thanks for being close to me during this madness :D

I havn't tried to use precompiled binaries yet. I'll give it a shot right now... I don't know how to have this work with Docker. I'll have a look. I should use a DistroBox or an image provided by NVIDIA or something like that?

hickscorp commented 3 months ago

Ok so not building from source seems to pass at least the compilation stage of XLA. Initially, what got me to try to compile XLA from source was that version mismatch at runtime. I woulnd't have tried without a good reason in my sane mind :smile: .

Now, it seems to have progressed:

...
[debug] Elixir.EAIML.Actors.LLMModel[flan_t5]: Initializing with model flan_t5...
[debug] Elixir.EAIML.Actors.VectorizerModel[paraphrase]: Initializing with model paraphrase...
[info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[info] XLA service 0x7fb7bc0e3b10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[info]   StreamExecutor device (0): NVIDIA GeForce RTX 4050 Laptop GPU, Compute Capability 8.9
[info] Using BFC allocator.
[info] XLA backend allocating 5566405017 bytes on device 0 for BFCAllocator.
[debug] Elixir.EAIML.Actors.LLMModel[flan_t5]: Done with task :tokenizer.
[info] Loaded cuDNN version 8907
[info] Start cannot spawn child process: No such file or directory
[info] Start cannot spawn child process: No such file or directory
[info] Using nvlink for parallel linking
[debug] Elixir.EAIML.Actors.LLMModel[flan_t5]: Done with task :gen_config.
[debug] QUERY OK db=0.5ms idle=14.2ms
...

I'm seeing messages that are indicating progress, and that I remember very much seeing when I had everything working on a diferent distro. So I'll report back after I try one of our models.

Out of curiosity, what kind of docker workflow do you recommend for "easy" local development, so that all vendor drivers / libs (NVIDIA etc) are separate from the host?

jonatanklosko commented 3 months ago

There are official cuda images from nvidia, such as nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04.

There is also cuda-based Livebook, which you can try:

docker run -p 8080:8080 -p 8081:8081 --pull always ghcr.io/livebook-dev/livebook:latest-cuda12.1

To test XLA with the current setup, you can do this:

$ XLA_TARGET=cuda120 iex
iex> Mix.install([:nx, :exla])
iex> Nx.global_default_backend(EXLA.Backend)
iex> Nx.tensor(1)

The resulting tensor should say "cuda" when printed.

hickscorp commented 3 months ago

Very cool. Thanks @jonatanklosko ! I'll give it a try if I can't get this to run nativelly on host.

Ok so now I'm trying to run something on CUDA, and I'm pretty positive that it's running on CPU.

The application config looks like this:

# We only want to run on NVIDIA GPUs. That's what Pierre has. Feel free to change this.
config :exla, :clients,
  cuda: [platform: :cuda],
  rocm: [platform: :rocm],
  tpu: [platform: :tpu],
  cpu: [platform: :cpu],
  host: [platform: :host]

# NX should use the EXLA-provided backend.
config :nx,
  default_backend: EXLA.Backend

I tried to change the first part to:

config :exla, :clients,
  cuda: [platform: :cuda],
  host: [platform: :cuda]

But I'm gessing that it's not how it works? Any idea why things wouldn't be running on CUDA now that it seems configured?

jonatanklosko commented 3 months ago

Are you using XLA_TARGET=cuda120? The best way is to export that env var in .bashrc or similar. To make sure you can remove deps and compile once again, with the env var set.

hickscorp commented 3 months ago

Actually - my bad. It might be running on GPU now.

Awesome! Thanks for the help @jonatanklosko and @polvalente.

So for those interested in getting things running on F40 beta, this is more or less what I've done.

NVIDIA Stuff

# Add third-party repo.
sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

# Update system, packages list etc.
sudo dnf update -y
# Install kernel headers, **probably** can be removed later.
sudo dnf install -y kernel-devel
# Install NVIDIA stuff.
sudo dnf install -y akmod-nvidia xorg-x11-drv-nvidia-cuda
# Enable NVIDIA power services.
sudo systemctl enable nvidia-hibernate.service nvidia-suspend.service nvidia-resume.service nvidia-powerd.service

CUDA and CuDNN Stuff

Most of this taken from this page on RPM Fusion.

# Add CUDA repo.
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora37/x86_64/cuda-fedora37.repo
sudo dnf clean all
# Disable the official nvidia-driver package, so it doesn't get installed by mistake.
sudo dnf module disable nvidia-driver
# Install cuda 12.1.x, as recommended here: https://github.com/elixir-nx/xla/issues/80.
sudo dnf install cuda-12.1.1 cuda-toolkit-12.1.1

# Download CuDNN by going to this page https://developer.nvidia.com/rdp/cudnn-archive and
# chose the highest "Local Installer for RedHat". In my case:
# https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-local-repo-rhel9-8.9.7.29-1.0-1.x86_64.rpm/
# Then install it:
sudo dnf install cudnn-local-repo-rhel9-8.9.7.29-1.0-1.x86_64.rpm
sudo dnf update --refresh
sudo dnf install libcudnn8-8.9.7.29-1.cuda12.2.x86_64 libcudnn8-devel-8.9.7.29-1.cuda12.2.x86_64

# Now install the NVIDIA Machine Learning repo that has libnccl for us to use. 
sudo dnf install https://developer.download.nvidia.com/compute/machine-learning/repos/rhel8/x86_64/nvidia-machine-learning-repo-rhel8-1.0.0-1.x86_64.rpm
sudo dnf install libnccl-2.8.3-1+cuda11.2 libnccl-devel-2.8.3-1+cuda11.2

# We now need to pin some packages so they don't get upgraded automatically:
...

Later if you want to compile XLA / EXLA for NX and Axon from scratch, you'll need to also install Bazel:

# Install Bazel...
dnf install bazel5
dnf copr enable vbatts/bazel

# Later when compiling the EXLA dependency, don't forget to `export XLA_TARGET=cuda120`
# Or something like:
rm -rf deps/xla
XLA_TARGET="cuda120" mix do deps.get, deps.compile

hickscorp commented 3 months ago

@jonatanklosko / @polvalente out of curiosity...

I'm using things like Jan on my computer and it performs fairly well. With XLA / NX, even the smallest model seem to blow up very quickly and run OOM... I'm trying to understand if I'm doing something wrong, or if it's expected when working with EXLA / Nx.

Do you have experience running models on fairly "low ram" GPUs (Eg 5GB)? Are there ways to offload to the RAM instead?

jonatanklosko commented 3 months ago

With XLA / NX, even the smallest model seem to blow up very quickly and run OOM...

XLA allocates most of the GPU upfront by default, which is expected. The OOM depends on the model. What model are you trying to run? 5GB is not going to be enough for LLM like Llama.

You can reduce memory usage on gpu by adding type: :bf16 or type: :f16 in Bumblebee.load_model.

I'm using things like Jan on my computer and it performs fairly well.

Jan uses llama.cpp underneath. One relevant feature we are currently missing is quantization, which for LLMs heavily reduces memory usage and inference time.

elixir-nx / xla