Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
18.97k stars 965 forks source link

ape mmap fails on embedded devices whose linux kernels are configured to have <48bit address space #74

Closed lukaszsobala closed 9 months ago

lukaszsobala commented 9 months ago

Hello,

I am trying to execute the llama file on arm64 linux: Linux rock-5b 5.10.110-rockchip-rk3588 #23.02.2 SMP Fri Feb 17 23:59:20 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

I added ape to the binfmt service. The error is as in the description:

ape error: llava-v1.5-7b-q4-server.llamafile: prog mmap failed w/ errno 12

Any idea what this might be?

The architecture of the CPU (Rockchip RK3588) is ARMv8-A - it should be enough, right?

lscpu lists these flags for the cpu: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp uid asimdrdm lrcpc dcpop asimddp.

I will try this on a mainline kernel too and report.

jart commented 9 months ago

12 means ENOMEM so I'd assume you're out of memory. Could you post the output of strace ape ./foo.llamafile please?

lukaszsobala commented 9 months ago

Not sure what foo.llamafile is but it can't find it:

strace ape ./foo.llamafile
execve("/usr/bin/ape", ["ape", "./foo.llamafile"], 0x7fe26c6408 /* 34 vars */) = 0
faccessat(AT_FDCWD, "./foo.llamafile", X_OK) = -1 ENOENT (No such file or directory)
write(2, "ape error: ./foo.llamafile: not "..., 68ape error: ./foo.llamafile: not found (maybe chmod +x or ./ needed)
) = 68
exit_group(127)                         = ?
+++ exited with 127 +++

But if I do it on the actual file, the output is:

strace ape ./llava-v1.5-7b-q4-server.llamafile
execve("/usr/bin/ape", ["ape", "./llava-v1.5-7b-q4-server.llamaf"...], 0x7fcc81bf18 /* 34 vars */) = 0
faccessat(AT_FDCWD, "./llava-v1.5-7b-q4-server.llamafile", X_OK) = 0
openat(AT_FDCWD, "./llava-v1.5-7b-q4-server.llamafile", O_RDONLY) = 3
pread64(3, "MZqFpD='\n\n\0\20\0\370\0\0\0\0\0\0\0\1\0\10@\0\0\0\0\0\0\0"..., 8192, 0) = 8192
pread64(3, "\1\0\0\0\5\0\0\0\0\200$\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0"..., 280, 3888) = 280
mmap(0x10000000000, 2005760, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x248000) = -1 ENOMEM (Cannot allocate memory)
write(2, "ape error: ./llava-v1.5-7b-q4-se"..., 77ape error: ./llava-v1.5-7b-q4-server.llamafile: prog mmap failed w/ errno 12
) = 77
exit_group(127)                         = ?
+++ exited with 127 +++
jart commented 9 months ago

Thanks. Could you give me ulimit -a output?

lukaszsobala commented 9 months ago
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 61756
max locked memory           (kbytes, -l) 2012040
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 61756
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
jart commented 9 months ago

I honestly have no idea why your device is reporting that it's out of memory when trying to map a measly 20mb from a 4gb file. If anyone has any ideas, feel free to chime in. Wish I could have been more helpful! Thanks for trying llamafile.

lukaszsobala commented 9 months ago

Thanks! I will post about results from mainline here, maybe it's an ancient BSP kernel issue.

jart commented 9 months ago

Do keep us posted. Someone on Discord reported a similar issue yesterday working on a similar device (that wasn't you was it?)

lukaszsobala commented 9 months ago

Do keep us posted. Someone on Discord reported a similar issue yesterday working on a similar device (that wasn't you was it?)

This wasn't me!

On mainline it loads fine, excerpt from last lines:

llama server listening at http://127.0.0.1:8080

failed to open http://127.0.0.1:8080/ in a browser tab using xdg-open: No such file or directory
loading weights...
{"timestamp":1702073176,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache

And no need to do any ape binfmt tricks. I am running it headless so obviously it can't open a browser.

Here is the kernel used: Linux rock-5b 6.5.0-rc2-mesa-rockchip-rk3588 #1 SMP PREEMPT Thu Aug 10 18:23:50 CEST 2023 aarch64 aarch64 aarch64 GNU/Linux - it's an out of date mainline so it should work fine on the current one too.

Edit: hmm, now I'm getting a "connection refused" error, something with my network? Trying via X forwarding now... Edit2: OK, it works.

jart commented 9 months ago

By BSP I'm assuming you mean this https://github.com/rockchip-linux/kernel ? That's very helpful to know that it's a custom linux kernel that the embedded hardware maker maintains on their own. Is it satisfactory for you to use the normal linux kernel? Or do we need a better workaround? Could you file an issue with rockchip reporting that mmap() doesn't appear to work and link this?

lukaszsobala commented 9 months ago

I am using an Armbian fork of their fork 😅 because this is the only one that gives me display output on a no name monitor.

I could prepare something like this but I'd first need to test if it's fixed in 5.10.160 (and 6.1 which will be released soon).

jart commented 9 months ago

We'd all really appreciate that effort if you can. Notice how it's the first mmap() call that fails? It could be the case that rockchip has a custom memory manager, which is something that's exceedingly difficult to write, and I could very easily imagine it being the case that the implementation just schleps the whole thing into memory and has a 32-bit size limit. I'm going to close out this issue since it doesn't appear to be actionable on our part. Please be sure to reference this issue though in the upstream report. Thanks!

t-chab commented 9 months ago

Don't know if it could help, but I have the exact same problem when I try to run llamafile in an up to date Termux on a Fairphone 4 (Qualcomm Snapdragon 750G SoC) running Murena /e/OS (Android v12).

uname -a output :

Linux localhost 4.19.157-perf-g005d9fbe9437 #1 SMP PREEMPT Thu Nov 9 18:09:18 UTC 2023 aarch64 Android

lscpu output:

Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: Qualcomm Model name: Kryo-4XX-Silver Model: 14 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 Stepping: 0xd CPU(s) scaling MHz: 100% CPU max MHz: 1804.8000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Model name: Cortex-A77 Model: 0 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r1p0 CPU(s) scaling MHz: 100% CPU max MHz: 2208.0000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via pr ctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Not affected Srbds: Not affected Tsx async abort: Not affected

ulimit -a :

real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 40 file size (blocks, -f) unlimited pending signals (-i) 20733 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 20733 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

strace output :

execve("/data/data/com.termux/files/usr/tmp/.ape-1.9", ["/data/data/com.termux/files/usr/"..., "./llava-v1.5-7b-q4-server.llamaf"...], 0x7fcc375c68 / 30 vars /) = 0 faccessat(AT_FDCWD, "./llava-v1.5-7b-q4-server.llamafile", X_OK) = 0 openat(AT_FDCWD, "./llava-v1.5-7b-q4-server.llamafile", O_RDONLY) = 3 pread64(3, "MZqFpD='\n\n\0\20\0\370\0\0\0\0\0\0\0\1\0\10@\0\0\0\0\0\0\0"..., 8192, 0) = 8192 pread64(3, "\1\0\0\0\5\0\0\0\0\200$\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0"..., 280, 3888) = 280 mmap(0x10000000000, 2005760, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x248000) = -1 ENOMEM (Out of memory) write(2, "ape error: ./llava-v1.5-7b-q4-se"..., 77ape error: ./llava-v1.5-7b-q4-server.llamafile: prog mmap failed w/ errno 12 ) = 77 exit_group(127) = ? +++ exited with 127 +++

lukaszsobala commented 9 months ago

It seems to indeed be fixed in Rockchip kernel 5.10.160 (also Armbian, I tried it on an rk3566 board which is a bit slow for this, to say the least - but works).

codeisnotcode commented 8 months ago

Same error as OP on a Rockchip 3588 running 5.10.160, so that kernel with that chip is still a problem. I haven't had an opportunity to try it with a Rockchip 3566.

jart commented 8 months ago

For what it's worth, we ended up discovering the issue here is you need a 48 bit address space.

On Raspberry Pi, if you get "mmap error 12" then it means your kernel is configured with fewer than 48 bits of address space. You need to upgrade to RPI 5. You can still use RPI 4 if you either (1) rebuild your kernel, or (2) get your SDcard OS image directly from Ubuntu (don't use RPI OS).

Added to https://github.com/mozilla-Ocho/llamafile?tab=readme-ov-file#gotchas

lukaszsobala commented 8 months ago

Same error as OP on a Rockchip 3588 running 5.10.160, so that kernel with that chip is still a problem. I haven't had an opportunity to try it with a Rockchip 3566.

There are (unfortunately) various forks of Rockchip 5.10.160. Which operating system does your come from?

BroJac5246 commented 8 months ago

Don't know if it could help, but I have the exact same problem when I try to run llamafile in an up to date Termux on a Fairphone 4 (Qualcomm Snapdragon 750G SoC) running Murena /e/OS (Android v12).

Same issue here on a Pixel 7a w/ 8 GB of RAM (stock Android 14). It's probably not supported but I would still love to see this fixed somehow.

jart commented 8 months ago

It's not currently supported because it's a use case I never thought of. I'm an open source developer who builds tools that let people independently distribute binaries. Phones are a closed platform that requires all binaries be compiled and distributed by a central source. @BroJac5246 I mean no impertinence but why do you want it? Does Google actually authorize people to bundle software like llamafile in their apps? Is this a hacked phone you're using as a desktop computer? Keep in mind I know next to nothing about app development, since I haven't done it in more than 15 years.

BroJac5246 commented 8 months ago

It's not currently supported because it's a use case I never thought of. I'm an open source developer who builds tools that let people independently distribute binaries. Phones are a closed platform that requires all binaries be compiled and distributed by a central source.

Termux is a program for Android that provides a Linux environment in which you can install packages and execute commands and programs. That's what we're using.

@BroJac5246 I mean no impertinence but why you want it?

Mostly because it's cool, but also because my phone is probably more powerful than my ancient computer 😂 Others might have more practical use cases, though.

Does Google actually authorize people to bundle software like llamafile in their apps?

So long as it fits within APK size limitations (which small models could), then yes, I'm pretty sure it's allowed (though this isn't really something I work with).

Is this a hacked phone you're using as a desktop computer?

Nope. It's stock, un-rooted Android.

Keep in mind I know next to nothing about app development, since I haven't done it in more than 15 years.

No worries! I figured it would probably be unsupported, but it was definitely worth checking out.

jart commented 8 months ago

Thank you for the information @BroJac5246 that's very helpful.

Cosmopolitan used to support Windows 7 where we used a hack to shrink the address space to have fewer bits. That support got moved into a separate branch called vista here: https://github.com/jart/cosmopolitan/tree/vista It might be possible to resurrect some of the differences in modern Cosmo, so that llamafile could use it.

Before considering resurrecting that, I'd want to know Google isn't intending to move to a 48-bit address space soon, similar to how RPI just did.

Here's something I think is really cool. I bought the following off Amazon:

It arrived this morning and I can't believe how fast it is. It has a 48 bit address space with a 16kb page size. If I run TinyLlama Q5_K_M on it then it generates 13 tokens per second, whereas my RPI 4 could only do 5 tokens/s. The set of supported ISAs has also grown from fp asimd evtstrm crc32 cpuid on RPI 4 to fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp on RPI 5 so I can't wait to see if there's opportunities for us to utilize that.

BroJac5246 commented 8 months ago

Thank you for the information @BroJac5246 that's very helpful.

Cosmopolitan used to support Windows 7 where we used a hack to shrink the address space to have fewer bits. That support got moved into a separate branch called vista here: https://github.com/jart/cosmopolitan/tree/vista It might be possible to recurrent some of the differences in modern Cosmo, so that llamafile could use it.

Very interesting.

Before considering resurrecting that, I'd want to know Google isn't intending to move to a 48-bit address space soon, similar to how RPI just did.

What is it currently?

Here's something I think is really cool. I bought the following off Amazon:

Raspberry Pis are really cool! I have a 3 B+ and it fascinates me to see what such a small computer can do, not that it's anywhere near as powerful as the 5.

It arrived this morning and I can't believe how fast it is. It has a 48 bit address space with a 16kb page size. If I run TinyLlama Q5_K_M on it then it generates 13 tokens per second, whereas my RPI 4 could only do 5 tokens/s.

Wow

The set of supported ISAs has also grown from fp asimd evtstrm crc32 cpuid on RPI 4 to fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp on RPI 5 so I can't wait to see if there's opportunities for us to utilize that.

I would be way in over my head trying to understand that but I hope you can do a lot with it 😂

davidmacmillan commented 8 months ago

Orange Pi OS (Arch) on a Rockchip 3588, which is based on GNU/Linux. uname -a gives me: Linux orangepi5plus 5.10.160-rockchip-rk3588 https://github.com/Mozilla-Ocho/llamafile/issues/1.0.8 SMP Mon Nov 13 18:27:15 CST 2023 aarch64 aarch64 aarch64 GNU/Linux

codeisnotcode commented 8 months ago

Respectfully, I don't see how the Rockchip 3588 error is due to a <48 bit address space.

The code runs on a Raspberry Pi 5 and that has four Cortex A76 cores. The Rockchip 3588 has four Cortex A76's and four Cortex A55's. ARM A76 cores and ARM A55 cores both have 40 bit architectural physical address spaces. So the Raspberry Pi doesn't meet the 48 bit criteria, yet it runs.

https://www.rock-chips.com/uploads/pdf/2022.8.26/192/RK3588%20Brief%20Datasheet.pdf https://developer.arm.com/Processors/Cortex-A76 https://developer.arm.com/Processors/Cortex-A55 https://www.raspberrypi.com/products/raspberry-pi-5/

codeisnotcode commented 8 months ago

Digging a bit further, the Raspberry Pi 5 is stepping r4p1 of Cortex A76 and The Rockchip 3588 is stepping r4p0. See lscpu outputs below.

For the change from r4p0 to r4p1, ARM reports "No functional changes to core for this revision" - page A1-32 of: https://developer.arm.com/documentation/100798/latest/

raspberrypi5:~$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Cortex-A76 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 4 Socket(s): - Cluster(s): 1 Stepping: r4p1 CPU(s) scaling MHz: 58% CPU max MHz: 2400.0000 CPU min MHz: 1000.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asim ddp Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, BHB Srbds: Not affected Tsx async abort: Not affected

orangepi5plus:~$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: ARM Model name: Cortex-A55 Model: 0 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: r2p0 CPU max MHz: 1800.0000 CPU min MHz: 408.0000 BogoMIPS: 48.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimd dp Model name: Cortex-A76 Model: 0 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 Stepping: r4p0 CPU max MHz: 2400.0000 CPU min MHz: 408.0000 BogoMIPS: 48.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimd dp Caches (sum of all): L1d: 384 KiB (8 instances) L1i: 384 KiB (8 instances) L2: 2.5 MiB (8 instances) L3: 3 MiB (1 instance) Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Vulnerable: Unprivileged eBPF enabled Srbds: Not affected Tsx async abort: Not affected

lukaszsobala commented 8 months ago

Orange Pi OS (Arch) on a Rockchip 3588, which is based on GNU/Linux. uname -a gives me: Linux orangepi5plus 5.10.160-rockchip-rk3588 https://github.com/Mozilla-Ocho/llamafile/issues/1.0.8 SMP Mon Nov 13 18:27:15 CST 2023 aarch64 aarch64 aarch64 GNU/Linux

There is your answer. You're using the orangepi fork of the kernel, which has this problem, while the Armbian fork does not. It's a mess.

codeisnotcode commented 7 months ago

Thanks!

flatsiedatsie commented 7 months ago

Just saw this same issue on a Raspberry Pi 4 with 4Gb or ram, trying to load a 700Mb model.

Linux dev 5.15.84-v8+ #1613 SMP PREEMPT Thu Jan 5 12:03:08 GMT 2023 aarch64 GNU/Linux

lukaszsobala commented 7 months ago

@flatsiedatsie which distribution is this kernel from? The version alone doesn't say much.

flatsiedatsie commented 7 months ago

Raspberry Pi OS 64 Lite

lukaszsobala commented 7 months ago

@flatsiedatsie try it with Ubuntu, Debian Bookworm or Armbian.

flatsiedatsie commented 6 months ago

Thanks for the suggestion. I already solved it by using llama.cpp directly.

kinchahoy commented 4 months ago

I get this error on an 8GB Raspberry Pi 5

uname -a Linux pi58gb 6.6.28+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.28-1+rpt1 (2024-04-22) aarch64 GNU/Linux

Is there an obvious thing I can do to upgrade / modify my OS to enable this? (It's an updated Raspberry Pi OS 64 Lite)

lukaszsobala commented 4 months ago

@kinchahoy it seems that the venerable Raspberry Pi Foundation chooses to gimp the kernel. As you can see above, another person using the distro had the same problem. You need to either install a different kernel (if you don't want to change the distribution) or recompile it yourself. The option to change is somewhere in this thread...