Closed Netsuno closed 7 months ago
I also get Illegal instruction (core dumped)
when using the docker image, while compiling from source seems to solve the issue.
This is on Pop Os 22.04 with kernel 6.2.0-76060200 on a Ryzen 5 5600X, x86_64 with avx2, with gcc11.
Some versions are fine though, we found that light-19726169b379bebc96189673a19b89ab1d307659
doesn't seem to have this problem but light-34c1072e497eb92d81ee7c0e12aa6741496a41c6
does ?
(we've been tracking this here too: https://github.com/nsarrazin/serge/pull/66)
Illegal instruction sounds like using a instruction which your processor does not support. I've touched on the issue in this discussion:
It's clear that as long as CPU features are determined at compile time, distributing binaries is going to cause problems like this.
@slaren That explains why the binaries are so inconsistent. We don't know what CPU's the github runners are using, thus making the binaries un-usable
Machine spec: Motherbord: Supermicro X9DRi-LN4+, Version REV:1.20A CPU: Dual Xeon E5-2670 v2 Ram: 128G DDR3 LRDIMM
@gaby yes they can vary, but for compilation it doesn't matter which cpu the runner has, only for tests. as you can see in the discussion how the windows builds always build avx512 but only test when its possible. if the docker builder looks at its own features when compiling the binaries, then its misconfigured. if i compile something for gameboy advance on my x86 pc, its not the features of my pc what i should choose when compiling. i'm not too familiar with docker but i suppose there has to be an option too which would not precompile binaries but rather have sources inside the container which would be compiled as the first step of installation.
but idk, the whole raison d'être for docker containers is to deal with the huge mess of interconnected dependencies in the linux world which are hard to deal with. but this project doesn't contain any dependencies or libraries and can be simply built on any machine. so i don't understand the value proposition of docker when it comes to this project at all, except the negative value of constantly having to deal with issues related to it.
if you are a absolute fan of docker and you just absolutely positively have to have it, the container could literally have a single .sh bash script which would do
git clone https://github.com/ggerganov/llama.cpp.git
make
and that's it lol. the beauty of having no libraries and dependencies.
for precompiled binaries currently the only option is to build packages for different options like the windows releases. in the future a better option would be to detect the features at runtime though, unless it cant be done without a performance penalty but probably it can. it has to be researched a bit though because it would affect inlining which cannot be done when the codepaths arent static. if inlining achieves performance benefit then we gotta stick with the multi builds as speed > everything else.
but idk, the whole raison d'être for docker containers is to deal with the huge mess of interconnected dependencies in the linux world which are hard to deal with. but this project doesn't contain any dependencies or libraries and can be simply built on any machine. so i don't understand the value proposition of docker when it comes to this project at all, except the negative value of constantly having to deal with issues related to it.
Whit Unraid i have two way to run it, whit a VM or whit a Docker, whit docker i can share the resource whit other process and whit a VM i ''lock'' the resource. This is for me the plus value of a Docker.
@anzz1 Thanks for the insight. After several tries it seems that compiling llama.cpp as a first step during runtime is the solution.
@anzz1 Thanks for the insight. After several tries it seems that compiling llama.cpp as a first step during runtime is the solution.
On a project with a million dependencies an libraries this might be a problem, but as there is no dependencies and builds on anything and thus compilation shouldn't pose a problem nor take more time than a few seconds. However in the post above there's an ongoing discussion about adding the ability of checking processor features at runtime. There is some work however in accurately testing and analyzing that it can be done without hurting performance, so it's currently on the backlog under more important issues.
When it's properly researched that it can be done without degrading performance, the part of adding it isn't hard at all. Just have to be sure to not introduce a regression while doing it.
@Netsuno have u succeed ? I got UNRAID too but dont succeed to run it on Docker
@Taillan I have make my own docker image to run it. But my Unraid server is not powerfull (2x xeon 2670 v2) so i have stop my idea to make it work on Unraid for now (it take 100% of my power for 2 minute to generate 1 answer)
On a project with a million dependencies an libraries this might be a problem, but as there is no dependencies and builds on anything and thus compilation shouldn't pose a problem nor take more time than a few seconds
A case for runtime detection:
In any reasonable, modern cloud deployment, llama.cpp would end up inside a container. In fact, being CPU-only, llama enables deploying your ML inference to something like AWS Lambda/GCP Cloud Run providing very simple, huge scalability for inference. All these systems use containerization and expect you to have pre-built binaries ready to go. Compiling at container launch is not really an option as that significantly increases cold-start/scale up latencies (a few seconds is too long).
However, the higher up the serverless stack you go, the less control you have over the CPU platform underneath. GCP, for example, has machines from Haswell era ++ all intermingled in and they don’t even document what to expect for Cloud Functions or Cloud Run.
I’m not a C expert by any means so not my wheelhouse to offer up a PR but the case for this is pretty strong IMO.
Got the same error:
ERROR: /app/.devops/tools.sh: line 40 6 Illegal instruction
./main $arg2
when I executed command:
docker run -v /models/llama7b:/home ghcr.io/ggerganov/llama.cpp:full --run -m /home/ggml-model-q4_1.bin -p "hello" -n 512
my environment is :
Docker Toolbox 1.13.1 docker client: 1.13.1 os/arch: windows 7 /amd64 docker server:19.03.12 os/arch:ubuntu 22.04 /amd64
Does anyone can help?
Got the same error:
ERROR: /app/.devops/tools.sh: line 40 6 Illegal instruction ./main $arg2
when I executed command:
docker run -v /models/llama7b:/home ghcr.io/ggerganov/llama.cpp:full --run -m /home/ggml-model-q4_1.bin -p "hello" -n 512
my environment is :
Docker Toolbox 1.13.1 docker client: 1.13.1 os/arch: windows 7 /amd64 docker server:19.03.12 os/arch:ubuntu 22.04 /amd64
Does anyone can help?
I'm afraid that you'll have to rebuild the image locally and use that instead. But that isn't very complicated.
On a project with a million dependencies an libraries this might be a problem, but as there is no dependencies and builds on anything and thus compilation shouldn't pose a problem nor take more time than a few seconds
A case for runtime detection:
In any reasonable, modern cloud deployment, llama.cpp would end up inside a container. In fact, being CPU-only, llama enables deploying your ML inference to something like AWS Lambda/GCP Cloud Run providing very simple, huge scalability for inference. All these systems use containerization and expect you to have pre-built binaries ready to go. Compiling at container launch is not really an option as that significantly increases cold-start/scale up latencies (a few seconds is too long).
However, the higher up the serverless stack you go, the less control you have over the CPU platform underneath. GCP, for example, has machines from Haswell era ++ all intermingled in and they don’t even document what to expect for Cloud Functions or Cloud Run.
I’m not a C expert by any means so not my wheelhouse to offer up a PR but the case for this is pretty strong IMO.
Yeah, the modern cloud environment where in many cases you have less control over and knowledge about the underlying hardware than what used to be is unfortunate, but it is reality.
You definitely do not want to go all-out runtime detection in a performance-driven application like this and lose the compiler optimizations allowed by compiler-time detection with simple #ifdef
's, hurting everyone else in the process for the sake of cloud and containerization but there is a case for having it both ways.
Something like this:
inline unsigned int ggml_cpu_has_avx512(void) {
#if defined(CPUDETECT_RUNTIME)
static unsigned const char a[] = {0x53,0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x0F,0xA2,0xC1,0xEB,0x10,0x83,0xE3,0x01,0x89,0xD8,0x5B,0xC3};
return ((unsigned int (__cdecl *)(void)) (void*)((void*)a))();
#elif defined(__AVX512F__)
return 1;
#else
return 0;
#endif
}
Then replacing any #ifdef __AVX512F__
with ggml_cpu_has_avx512()
allows for runtime detection when configured as such and when not, the compiler should optimize it away and have the same end result as #ifdef
that is not messing up its' optimization logic. However compilers can be finicky sometimes so it's definitely prudent to check with a disassembler that the end result really is the same.
edit: To be clear, using bytecode in the example above isn't being obtuse for the sake of being obtuse in some misguided attempt of trying to look smart or something. Optimally you'd use __asm { }
, but the reason why you can't is that contrary to every other compiler out there, MSVC decided to drop inline assembly support for the 64-bit era, a decision made to be a bane of low-level coders existences' ever since. Bytecode is the only thing that works for every compiler. If you want to see what's going on, you can copy paste the bytecode above to https://disasm.czbix.com/ for example.
Here's a list of some of the (x86) processor feature checks in bytecode: cpuid.h
Rest can be found in the x86 documentation:
Intel ® Architecture Instruction Set Extensions and Future Features "Chapter 1.5 CPUID Instruction"
AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions "Appendix D: Instruction Subsets and CPUID Feature Flags"
Easiest solution, imho, is to provide multiple versions of the container image. It doesn't have to cover all architectures, and it doesn't have to be every release. But setup which covers >=85% of consumers at any given time is enough.
The rest can rebuild.
Easiest solution, imho, is to provide multiple versions of the container image. It doesn't have to cover all architectures, and it doesn't have to be every release. But setup which covers >=85% of consumers at any given time is enough.
The rest can rebuild.
Sure, the easiest solution would be to create a container image for every configuration set, and it could be easily automated with github actions. It's a solution, but not a good one, as it's not future proof and carries a risk of getting stuck with a bad practice. You know what they say, Nothing is more permanent than a temporary solution :smile:
I didn't say every configuration set. Unless the runtime detection has only negligible impact on performance I think it's better for consumer to just get image optimized for their architecture. Obviously there is a point of diminishing returns. But even Intel provides optimized images for avx512 [0].
[0] https://hub.docker.com/r/intel/intel-optimized-tensorflow-avx512
I didn't say every configuration set. Unless the runtime detection has only negligible impact on performance I think it's better for consumer to just get image optimized for their architecture. Obviously there is a point of diminishing returns. But even Intel provides optimized images for avx512 [0].
[0] hub.docker.com/r/intel/intel-optimized-tensorflow-avx512
Sure, you could do automatic separate images for AVX / AVX2 / AVX512 like the Windows releases by just editing the action, no code change necessary.
Or as the binaries are rather small, you could pack each of them in one image and add a simple script to have "launch-time" detection if you will, something like this:
#!/bin/sh
cpuinfo="$(cat /proc/cpuinfo)"
if [ $(echo "$cpuinfo" | grep -c avx512) -gt 0 ]; then
./llama_avx512 "$@"
elif [ $(echo "$cpuinfo" | grep -c avx2) -gt 0 ]; then
./llama_avx2 "$@"
else
./llama_avx "$@"
fi
If your model is already quantized, this did the trick for me, using light:
docker run -v /E/Projects/llama.cpp/models:/models ghcr.io/ggerganov/llama.cpp:light -m models/7B/llama-2-7b-chat.ggmlv3.q4_0.bin -p "hello" -n 512
I have a similar issue in docker on some machines. I'm using local/llama.cpp:full-cuda
After an strace
, it turned out /server
couldn't find libcublas.so.11
.
However I have it in /usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublas.so.11
Perhaps something wrong with the way I built, still investigating.
gdb
Starting program: /app/server
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction.
0x0000559816d04ef6 in gpt_params::gpt_params() ()
strace
ldd /app/server
linux-vdso.so.1 (0x00007fffbff45000)
libcublas.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.11 (0x00007fab17200000)
libcudart.so.11.0 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fab16e00000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fab16bd4000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007fab17119000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fab20484000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007fab169ac000)
/lib64/ld-linux-x86-64.so.2 (0x00007fab21001000)
libcublasLt.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007fab02a00000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007fab2047d000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fab20478000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007fab20473000)
cuda-supported docker image works like a charm and fairly quick. but then the 1 out of 10 machines you deploy to crashes with this 'illegal instruction' error.
the issue also is often reported across adaptations:
https://github.com/ollama/ollama/issues/2187 https://github.com/search?q=repo%3Aoobabooga%2Ftext-generation-webui+illegal+instruction&type=issues
I'm not too familiar with these instructions, but is it not feasible to have one workflow that builds one docker image that you can deploy reliably? it just works so well and not having one docker image is a bit of a shame.
I have a similar issue in docker on some machines. I'm using
local/llama.cpp:full-cuda
After an
strace
, it turned out/server
couldn't findlibcublas.so.11
.However I have it in
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcublas.so.11
Perhaps something wrong with the way I built, still investigating.gdb
Starting program: /app/server warning: Error disabling address space randomization: Operation not permitted [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. 0x0000559816d04ef6 in gpt_params::gpt_params() ()
strace
Expand strace output
ldd /app/server linux-vdso.so.1 (0x00007fffbff45000) libcublas.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.11 (0x00007fab17200000) libcudart.so.11.0 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fab16e00000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fab16bd4000) libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007fab17119000) libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fab20484000) libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007fab169ac000) /lib64/ld-linux-x86-64.so.2 (0x00007fab21001000) libcublasLt.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007fab02a00000) librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007fab2047d000) libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fab20478000) libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007fab20473000)
try
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local//usr/local/cuda-11.7/targets/x86_64-linux/lib
https://stackoverflow.com/questions/54249577/importerror-libcuda-so-1-cannot-open-shared-object-file
This issue was closed because it has been inactive for 14 days since being marked as stale.
I try to make it run the docker version on Unraid,
I run this as post Arguments:
--run -m /models/7B/ggml-model-q4_0.bin -p "This is a test" -n 512
I got this error:
/app/.devops/tools.sh: line 40: 7 Illegal instruction ./main $arg2
Log:
I have run this whitout any issus:
--all-in-one "/models/" 7B