Local cuda container build fails with "unsupported instruction `vpdpbusd'"

nzwulfin commented 2 days ago

Trying to build on my home system, ./container_build.sh cuda will fail with the following error

/tmp/ccnKypuJ.s: Assembler messages:
/tmp/ccnKypuJ.s:31871: Error: unsupported instruction `vpdpbusd'
/tmp/ccnKypuJ.s:31926: Error: unsupported instruction `vpdpbusd'
/tmp/ccnKypuJ.s:31995: Error: unsupported instruction `vpdpbusd'
/tmp/ccnKypuJ.s:32060: Error: unsupported instruction `vpdpbusd'
/tmp/ccnKypuJ.s:32113: Error: unsupported instruction `vpdpbusd'
gmake[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:132: ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:1591: ggml/src/CMakeFiles/ggml.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2

From what I can tell online, this is due to the binutils in RHEL 9 not being new enough to support the instruction.

I made some progress by adding the GCC Toolset 12 to the cuda portion of the dnf_install switch statement, but I'm not familiar enough with what needs to really get set to use the toolset correctly. I expect that scl enable is doing a lot more with the path than I exported.

    dnf install -y gcc-toolset-12 
    export CC=/opt/rh/gcc-toolset-12/root/usr/bin/gcc
    export CCXX=/opt/rh/gcc-toolset-12/root/usr/bin/g++

I've hit my limit for testing but thought I'd report the issue anyhow.

nzwulfin commented 2 days ago

I examined a UBI 9 container with the toolset and the CUDA dev container and brute forced a few more exports for the build to complete. I don't think this is the right solution, but might serve as a pointer to one.

  elif [ "$containerfile" = "cuda" ]; then
    dnf install -y "${rpm_list[@]}"
    dnf install -y gcc-toolset-12 
    export CC=/opt/rh/gcc-toolset-12/root/usr/bin/gcc
    export CCXX=/opt/rh/gcc-toolset-12/root/usr/bin/g++
    export PKG_CONFIG_PATH=/opt/rh/gcc-toolset-12/root/usr/lib64/pkgconfig
    export INFOPATH=/opt/rh/gcc-toolset-12/root/usr/share/info
    export LD_LIBRARY_PATH=/opt/rh/gcc-toolset-12/root/usr/lib64:/opt/rh/gcc-toolset-12/root/usr/lib:$LD_LIBRARY_PATH
    export PATH=/usr/share/Modules/bin:/opt/rh/gcc-toolset-12/root/usr/bin:/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH

bmahabirbu commented 2 days ago

@nzwulfin good analysis. Did you also try doing scl enable gcc-toolset-12 bash before doing the exports? It will create a separate terminal with the GCC toolset 12 and it should avoid the error.

In general, I have personally tested the building process on Ubuntu 24.04, Ubuntu 22.04 in WSL2, and Fedora 40 but I'm new to Rhel 9!

ericcurtin commented 2 days ago

Let's open a PR and get this change in, related issue:

https://github.com/ggerganov/llama.cpp/issues/5316

nzwulfin commented 2 days ago

@bmahabirbu I did try the scl enable bash step both in the switch and after the dnf_install in the main body. I didn't see any changes to which GCC got picked up by cmake, but it also didn't throw any errors.

I didn't have any problems in a local version of the cuda:12.6.2-devel-ubi9 container:

[root@13fd7588eacd /]# scl enable gcc-toolset-12 bash

[root@30d20f630919 /]# gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It might be the bash invocation inside a running script.

I found the enable file and it's mainly just a bunch of env exports. I'm going to test changing all my exports to

source /opt/rh/gcc-toolset-12/enable

I'll report in once I have a local build

nzwulfin commented 2 days ago

Let's open a PR and get this change in, related issue:

ggerganov/llama.cpp#5316

Well if I had read the issue Eric linked, I could have saved all my testing this morning ;)

Based on https://github.com/ggerganov/llama.cpp/issues/5316#issuecomment-2059566094 and https://github.com/ggerganov/llama.cpp/issues/5316#issuecomment-2175919262

I should have the right combo in this attempt:

  elif [ "$containerfile" = "cuda" ]; then
    dnf install -y "${rpm_list[@]}"
    dnf install -y gcc-toolset-12 
    source /opt/rh/gcc-toolset-12/enable

nzwulfin commented 2 days ago

The llama.cpp compile was a little noisy b/c of an enabled warning but it worked and was able to get llama3.2 working via notes in the discussion. I'll clean up my local repo and submit a PR so folks can look at it in context.

Here's the warning I was seeing in case someone wants to think about silencing it.

/opt/rh/gcc-toolset-12/root/usr/lib/gcc/x86_64-redhat-linux/12/include/avx512fintrin.h:5946:10: warning: '__Y' may be used uninitialized [-Wmaybe-uninitialized]

bmahabirbu commented 2 days ago

@ericcurtin good find for that issue! I'm surprised I didn't come upon it during my search.

@nzwulfin my apologies but thank you for testing my suggestion anyway! Guess scl enable doesn't properly give access to gcc toolket 12. It's good to know that using sources works.

nzwulfin commented 2 days ago

@bmahabirbu no worries, I wanted to make sure I didn't miss anything the first time I tried it!

nzwulfin commented 2 days ago

PR #473 submitted, thanks y'all!

nzwulfin commented 1 day ago

PR #473 was merged, local test confirmed the fix

containers / ramalama

Local cuda container build fails with "unsupported instruction `vpdpbusd'" #471