Open KagurazakaNyaa opened 5 months ago
adding revert pr to help isolate the problem:
adding revert pr to help isolate the problem:
2409
Reverting PR #2043 does not isolate this issue, because the issue existed before PR #2043. PR #2043 confirms that the issue is in the configuration outside the code repository.
Right - just created a branch without #2403 to check the latest successful ROCm image build version and to compare.
The same action workflow but without push image can be executed normally at https://github.com/KagurazakaNyaa/tabby/actions/runs/9496954836. This fork uses GitHub's default runner instead of the self-hosted runner. From the error message in this issue, it seems to be a problem with the action runner rather than the workflow. Is this repository using a GitHub-hosted runner or a self-hosted runner?
Also the ROCm version is kind of outdated with 5.7.1 although compatible with older cards, version 6.1.2 is out and has massive improvements in the newer cards, I don't know how much this can affect the model performance.
I tried compiling the 0.12.0 tag and I get this error with my registry and also tried local with this command
command: serve --model /data/models/rudiservo/StarCoder2-15b-Instruct-v0.1-Q8 --device rocm --no-webserver
tabby_1 | The application panicked (crashed).
tabby_1 | Message: Invalid model_id <TabbyML/Nomic-Embed-Text>
tabby_1 | Location: crates/tabby-common/src/registry.rs:108
tabby_1 |
tabby_1 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tabby_1 | ⋮ 7 frames hidden ⋮
tabby_1 | 8: tabby_common::registry::ModelRegistry::get_model_info::h4cf4522936634953
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 9: tabby_download::download_model::{{closure}}::h8da4574c84d31459
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 10: tabby::services::model::download_model_if_needed::{{closure}}::h88e90df5ccbc9220
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 11: tabby::serve::main::{{closure}}::h895907983720205f
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 12: tokio::runtime::park::CachedParkThread::block_on::h69f0496402a974e5
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 13: tabby::main::h244e2d137a039971
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 14: std::sys_common::backtrace::__rust_begin_short_backtrace::h37fe2660d85af9e6
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 15: std::rt::lang_start::{{closure}}::hfc465164803e6038
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 16: std::rt::lang_start_internal::h3ed4fe7b2f419135
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 17: main<unknown>
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 | 18: __libc_start_call_main<unknown>
tabby_1 | at ./csu/../sysdeps/nptl/libc_start_call_main.h:58
tabby_1 | 19: __libc_start_main_impl<unknown>
tabby_1 | at ./csu/../csu/libc-start.c:392
tabby_1 | 20: _start<unknown>
tabby_1 | at <unknown source file>:<unknown line>
tabby_1 |
docker ROCm is still not building, latest version is 0.11.
Hi - we turned off the rocm build as our github action runner is not able to complete it - as an alternative, I recommend use vulkan backend instead for amd gpu deployments.
@wsxiaoys well llamacpp rocm docker builds are also failing, but metal are ok. I am going to try and fix the llamacpp build then check if you have a similar issue or something I can quick fix.
@wsxiaoys so I figured one part, but I am kind of hitting a wall, maybe some config is missing?!
On build.rs you need to change this
config.define("LLAMA_HIPBLAS", "ON");
to
config.define("GGML_HIPBLAS", "ON");
and add this for future compatibility with rocm an and future proof for 6.1.2
config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "HIPCXX", format!("{}/llvm/bin/clang", rocm_root), ); config.define( "HIP_PATH", format!("{}", rocm_root), );
but now I get this error
WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory
can't figure out why llama-server can't access libomp.so.
I managed to build llamacpp llama-server with rocm docker 5.7.1 and 6.1.2 and it runs great.
Everything was tested today with tabby's master branch.
Any pointers why this happens?
I tried to get the v0.13.1
working with an AMD GPU and came across the very same Warning as @rudiservo (WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory
)
I had a look into the container and same as llama_cpp_server, I also could not find libomp.so
or anything related like libomp.so.5
or something. So I tried adding libomp-dev
to the packages that need to be installed for the runtime image. This installs /usr/lib/x86_64-linux-gnu/libomp.so.5
(among other stuff?). Now creating a symlink does in fact solve the problem and I was able to run tabby v0.13.1
built from Dockerfile.rocm
.
So again, what it comes down to is this part in the runtime image:
RUN apt-get update && \
apt-get install -y --no-install-recommends \
git \
curl \
openssh-client \
ca-certificates \
libssl3 \
rocblas \
hipblas \
libgomp1 \
# add the package that provides libomp.so
libomp-dev \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# create the symlink
ln -s /usr/lib/x86_64-linux-gnu/libomp.so.5 /usr/lib/x86_64-linux-gnu/libomp.so
I want to stress that I have no experience with any of this. It's just tinkering around and getting tabby with amd to work on my machine, so I don't know if libomp-dev
is the appropriate package for this problem or if other packages do already install libomp.so
and it's just not available on PATH
or wherever it needs to be defined. Also it feels wrong to be required to manually create a symlink.
@JayPi4c That is strange, libomp exists in the /opt/rocm. With my llamacpp docker (not the tabby) it works fine, but I it was made with make, not cmake... maybe it's an option in cmake that does not set the /opt/rocm.
root@3a7c21116e01:/app# find -L /opt -name "libomp.so"
/opt/rocm/lib/llvm/lib/libomp.so
/opt/rocm/lib/llvm/lib-debug/libomp.so
/opt/rocm/llvm/lib/libomp.so
/opt/rocm/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/llvm/lib/libomp.so
/opt/rocm-6.1.2/llvm/lib-debug/libomp.so
Thanks! I did not know about /opt/rocm
.
Currently PATH
looks like this:
root@d17bd20c90d1:/# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/tabby/bin
So there is no reference to /opt/rocm
. But there is also no reference to /usr/lib/x86_64-linux-gnu
as well and still libomp.so
is picked up from there. I also quickly checked and simply adding /opt/rocm/lib/llvm/lib
to PATH
does not solve the problem. So I guess, there needs to be some other configuration to point llama-server
to the correct location of libomp.so
.
Sadly I don't know cpp, rust and its build-tools, so I don't know where to put the reference. But what I found was a documentation on how to use rocm
with cmake
. There they say something about CMAKE_PREFIX_PATH
. But again, I don't know what to do with this information.
@JayPi4c well there is in the Makefile of llamacpp
MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas
I can understand makefiles, with cmake I am going to admit my ignorance.
I do not know if these flags are even passed to the cpp with Cmake.
In the make file it is swaped by LDFLAGS
override LDFLAGS := $(MK_LDFLAGS) $(LDFLAGS)
in llama-server it is passed by
$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)
well time to learn cmake and figure out what is going on.
I think I might have found the issue.
Going to try and compile llamacpp with cmake to test.
I think that in llamacpp cmake/llama-config.cmake.in you have the GGML_HIPBLAS variable that has find_package, but does not add rocm path as an add_library.
I will refer to this issue in llamacpp that I opened. https://github.com/ggerganov/llama.cpp/issues/8213
Describe the bug action: Create and publish docker image run failed
https://github.com/TabbyML/tabby/actions/runs/9506018585 release-docker (rocm) The hosted runner: GitHub Actions 15 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
Additional context In PR #2043, I attempted to update the version of action. In my fork, it can be built normally; however, after merging, it is still unable to build rocm docker images normally. It's recommended that check if a self-hosted Action Runner has been configured incorrectly.