Pelochus / ezrknpu

Easy usage of Rockchip's NPUs found in RK3588 and similar chips
GNU General Public License v3.0
46 stars 4 forks source link

Fix for memory issues #3

Open av1d opened 2 months ago

av1d commented 2 months ago

Here are some different scenarios and their results if there is an issue loading models, as well as a solution at the end:

as regular user (after setting ulimit -n 4096):

av1d@ubuntu:~/ez/qwen-1_8B-rk3588$ rkllm qwen-chat-1_8B.rkllm 
loaded template from prompt.txt
rkllm init start
E RKNN: [07:54:33.648] failed to allocate handle, ret: -1, errno: 23, errstr: Too many open files in system
E RKNN: [07:54:33.648] failed to malloc npu memory, size: 6528, flags: 0x2
E RKNN: [07:54:33.648] failed to allocate handle, ret: -1, errno: 23, errstr: Too many open files in system
E RKNN: [07:54:33.648] failed to malloc npu memory, size: 480, flags: 0xa
E RKNN: [07:54:33.648] failed to allocate handle, ret: -1, errno: 23, errstr: Too many open files in system
E RKNN: [07:54:33.648] failed to malloc npu memory, size: 8832, flags: 0x2
E RKNN: [07:54:33.648] failed to allocate handle, ret: -1, errno: 23, errstr: Too many open files in system
E RKNN: [07:54:33.648] failed to malloc npu memory, size: 960, flags: 0xa
[truncated for brievity]

as sudo with -E to preserve environment (and ulimit -n 4096):

av1d@ubuntu:~/ez/qwen-1_8B-rk3588$ sudo -E rkllm qwen-chat-1_8B.rkllm 
loaded template from prompt.txt
rkllm init start
E RKNN: [07:54:49.293] failed to convert handle(1020) to fd, ret: -1, errno: 24, errstr: Too many open files
Segmentation fault

as root (default ulimit -n 1024):

root@ubuntu:/home/av1d/ez/qwen-1_8B-rk3588# rkllm qwen-chat-1_8B.rkllm 
loaded template from prompt.txt
rkllm init start
E RKNN: [07:56:27.818] failed to convert handle(1020) to fd, ret: -1, errno: 24, errstr: Too many open files
Segmentation fault
root@ubuntu:/home/av1d/ez/qwen-1_8B-rk3588# 

it only works as root, and only after setting ulimit to 4096:

root@ubuntu:/home/av1d/ez/qwen-1_8B-rk3588# ulimit -n 4096
root@ubuntu:/home/av1d/ez/qwen-1_8B-rk3588# rkllm qwen-chat-1_8B.rkllm 
loaded template from prompt.txt
rkllm init start
RKLLM init success!

also doesn't work: editing /etc/security/limits.conf doesn't work (* soft nofile 16384 and * hard nofile 1048576).

The solution for me was to edit /etc/sysctl.conf, add fs.file-max = 1048576, then run sudo sysctl -p after. Tested on Ubuntu 22.04.3 LTS. Can now load models as regular user without issue.

Pelochus commented 2 months ago

Have you tried adding this to /etc/security/limits.conf?

root soft nofile 16384

or

your_user_name_here soft nofile 16384

In my case I tested with root user and it worked. Surprisingly didn't work using * instead of root... Gonna check your solution once you verify that second example for me.

av1d commented 2 months ago

Have you tried adding this to /etc/security/limits.conf?

Yes, I did to no avail (mentioned that above). Thankfully my solution works well. One more related thing I would like to document publicly in general is this error:

av1d@ubuntu:~/ez/qwen-1_8B-rk3588$ rkllm qwen-chat-1_8B.rkllm 
loaded template from prompt.txt
rkllm init start
E RKNN: [16:12:46.194] failed to allocate handle, ret: -1, errno: 14, errstr: Bad address
Segmentation fault

This is basically a "resource exhaustion" message, or the memory locations are already in use, which I feel is worth knowing as the error is not useful.
You can cause it for example by starting one instance of rkllm then another simultaneously.
I feel it's important to note this because it might help us debug more memory issues.

Pelochus commented 2 months ago

Interesting finding there. For now I think it's better leaving that specific thing to the Rockchip guys, after all we can only circumvent those problems without an open source library.

As for my proposed solution, I meant to put the exact user you want soft/hard file limits modified not using the *. Are you sure you tried this? It works for me on Armbian Jammy (based on Ubuntu 22 LTS, same as you).

I'm going to try your solution anyway, perhaps this is something specifically tied to Ubuntu, AFAIK changing limits.conf should work for pretty much any Linux distro.

av1d commented 2 months ago

Interesting finding there. For now I think it's better leaving that specific thing to the Rockchip guys, after all we can only circumvent those problems without an open source library.

As for my proposed solution, I meant to put the exact user you want soft/hard file limits modified not using the *. Are you sure you tried this? It works for me on Armbian Jammy (based on Ubuntu 22 LTS, same as you).

I'm going to try your solution anyway, perhaps this is something specifically tied to Ubuntu, AFAIK changing limits.conf should work for pretty much any Linux distro.

My apologies, you are correct, there is a distinction there. I do not believe I tried that method, and am not sure how to test it because I removed the settings of /etc/sysctl.conf and rebooted by they persist for some reason.

Pelochus commented 2 months ago

Don't worry. Test it whenever you can. If we check that method works for you I will change it to using the usual method but putting the root user + the usual user (via args input, for example).