rocm docker github action build failed

KagurazakaNyaa commented 5 months ago

Describe the bug action: Create and publish docker image run failed

https://github.com/TabbyML/tabby/actions/runs/9506018585 release-docker (rocm) The hosted runner: GitHub Actions 15 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Additional context In PR #2043, I attempted to update the version of action. In my fork, it can be built normally; however, after merging, it is still unable to build rocm docker images normally. It's recommended that check if a self-hosted Action Runner has been configured incorrectly.

wsxiaoys commented 5 months ago

adding revert pr to help isolate the problem:

https://github.com/TabbyML/tabby/pull/2409

KagurazakaNyaa commented 5 months ago

adding revert pr to help isolate the problem:

2409

Reverting PR #2043 does not isolate this issue, because the issue existed before PR #2043. PR #2043 confirms that the issue is in the configuration outside the code repository.

wsxiaoys commented 5 months ago

Right - just created a branch without #2403 to check the latest successful ROCm image build version and to compare.

KagurazakaNyaa commented 5 months ago

The same action workflow but without push image can be executed normally at https://github.com/KagurazakaNyaa/tabby/actions/runs/9496954836. This fork uses GitHub's default runner instead of the self-hosted runner. From the error message in this issue, it seems to be a problem with the action runner rather than the workflow. Is this repository using a GitHub-hosted runner or a self-hosted runner?

rudiservo commented 5 months ago

Also the ROCm version is kind of outdated with 5.7.1 although compatible with older cards, version 6.1.2 is out and has massive improvements in the newer cards, I don't know how much this can affect the model performance.

I tried compiling the 0.12.0 tag and I get this error with my registry and also tried local with this command command: serve --model /data/models/rudiservo/StarCoder2-15b-Instruct-v0.1-Q8 --device rocm --no-webserver

tabby_1  | The application panicked (crashed).
tabby_1  | Message:  Invalid model_id <TabbyML/Nomic-Embed-Text>
tabby_1  | Location: crates/tabby-common/src/registry.rs:108
tabby_1  | 
tabby_1  |   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tabby_1  |                                 ⋮ 7 frames hidden ⋮                               
tabby_1  |    8: tabby_common::registry::ModelRegistry::get_model_info::h4cf4522936634953
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |    9: tabby_download::download_model::{{closure}}::h8da4574c84d31459
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   10: tabby::services::model::download_model_if_needed::{{closure}}::h88e90df5ccbc9220
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   11: tabby::serve::main::{{closure}}::h895907983720205f
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   12: tokio::runtime::park::CachedParkThread::block_on::h69f0496402a974e5
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   13: tabby::main::h244e2d137a039971
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   14: std::sys_common::backtrace::__rust_begin_short_backtrace::h37fe2660d85af9e6
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   15: std::rt::lang_start::{{closure}}::hfc465164803e6038
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   16: std::rt::lang_start_internal::h3ed4fe7b2f419135
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   17: main<unknown>
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   18: __libc_start_call_main<unknown>
tabby_1  |       at ./csu/../sysdeps/nptl/libc_start_call_main.h:58
tabby_1  |   19: __libc_start_main_impl<unknown>
tabby_1  |       at ./csu/../csu/libc-start.c:392
tabby_1  |   20: _start<unknown>
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |

rudiservo commented 4 months ago

docker ROCm is still not building, latest version is 0.11.

wsxiaoys commented 4 months ago

Hi - we turned off the rocm build as our github action runner is not able to complete it - as an alternative, I recommend use vulkan backend instead for amd gpu deployments.

rudiservo commented 4 months ago

@wsxiaoys well llamacpp rocm docker builds are also failing, but metal are ok. I am going to try and fix the llamacpp build then check if you have a similar issue or something I can quick fix.

rudiservo commented 4 months ago

@wsxiaoys so I figured one part, but I am kind of hitting a wall, maybe some config is missing?!

On build.rs you need to change this config.define("LLAMA_HIPBLAS", "ON"); to config.define("GGML_HIPBLAS", "ON");

and add this for future compatibility with rocm an and future proof for 6.1.2

config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "HIPCXX", format!("{}/llvm/bin/clang", rocm_root), ); config.define( "HIP_PATH", format!("{}", rocm_root), );

but now I get this error WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

can't figure out why llama-server can't access libomp.so.

I managed to build llamacpp llama-server with rocm docker 5.7.1 and 6.1.2 and it runs great.

Everything was tested today with tabby's master branch.

Any pointers why this happens?

JayPi4c commented 4 months ago

I tried to get the v0.13.1 working with an AMD GPU and came across the very same Warning as @rudiservo (WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory) I had a look into the container and same as llama_cpp_server, I also could not find libomp.so or anything related like libomp.so.5 or something. So I tried adding libomp-dev to the packages that need to be installed for the runtime image. This installs /usr/lib/x86_64-linux-gnu/libomp.so.5 (among other stuff?). Now creating a symlink does in fact solve the problem and I was able to run tabby v0.13.1 built from Dockerfile.rocm. So again, what it comes down to is this part in the runtime image:

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    git \
    curl \
    openssh-client \
    ca-certificates \
    libssl3 \
    rocblas \
    hipblas \
    libgomp1 \
    # add the package that provides libomp.so
    libomp-dev \
    && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    #  create the symlink
    ln -s /usr/lib/x86_64-linux-gnu/libomp.so.5 /usr/lib/x86_64-linux-gnu/libomp.so

I want to stress that I have no experience with any of this. It's just tinkering around and getting tabby with amd to work on my machine, so I don't know if libomp-dev is the appropriate package for this problem or if other packages do already install libomp.so and it's just not available on PATH or wherever it needs to be defined. Also it feels wrong to be required to manually create a symlink.

rudiservo commented 4 months ago

@JayPi4c That is strange, libomp exists in the /opt/rocm. With my llamacpp docker (not the tabby) it works fine, but I it was made with make, not cmake... maybe it's an option in cmake that does not set the /opt/rocm.

root@3a7c21116e01:/app# find -L /opt -name "libomp.so"
/opt/rocm/lib/llvm/lib/libomp.so
/opt/rocm/lib/llvm/lib-debug/libomp.so
/opt/rocm/llvm/lib/libomp.so
/opt/rocm/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/llvm/lib/libomp.so
/opt/rocm-6.1.2/llvm/lib-debug/libomp.so

JayPi4c commented 4 months ago

Thanks! I did not know about /opt/rocm.

Currently PATH looks like this:

root@d17bd20c90d1:/# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/tabby/bin

So there is no reference to /opt/rocm. But there is also no reference to /usr/lib/x86_64-linux-gnu as well and still libomp.so is picked up from there. I also quickly checked and simply adding /opt/rocm/lib/llvm/lib to PATH does not solve the problem. So I guess, there needs to be some other configuration to point llama-server to the correct location of libomp.so. Sadly I don't know cpp, rust and its build-tools, so I don't know where to put the reference. But what I found was a documentation on how to use rocm with cmake. There they say something about CMAKE_PREFIX_PATH. But again, I don't know what to do with this information.

rudiservo commented 4 months ago

@JayPi4c well there is in the Makefile of llamacpp

    MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
    MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
    MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas

I can understand makefiles, with cmake I am going to admit my ignorance.

I do not know if these flags are even passed to the cpp with Cmake.

In the make file it is swaped by LDFLAGS

override LDFLAGS := $(MK_LDFLAGS) $(LDFLAGS)

in llama-server it is passed by

$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)

well time to learn cmake and figure out what is going on.

rudiservo commented 4 months ago

I think I might have found the issue.

Going to try and compile llamacpp with cmake to test.

I think that in llamacpp cmake/llama-config.cmake.in you have the GGML_HIPBLAS variable that has find_package, but does not add rocm path as an add_library.

I will refer to this issue in llamacpp that I opened. https://github.com/ggerganov/llama.cpp/issues/8213

TabbyML / tabby

rocm docker github action build failed #2408

2409