getumbrel / llama-gpt

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
https://apps.umbrel.com/app/llama-gpt
MIT License
10.73k stars 696 forks source link

Memory Allocation Error #48

Open VivaWolf opened 1 year ago

VivaWolf commented 1 year ago

Currently, I am only able to get the 7B model running, and it takes 15-20 seconds per token.

Docker Desktop shows container memory usage at only 600-800mb / 1.89GB and only 2 cores allocated.

I'm getting this error:

warning: failed to mlock 73728000-byte buffer (after previously locking 73744384 bytes): Cannot allocate memory
llama-gpt-llama-gpt-api-7b-1  | Try increasing RLIMIT_MLOCK ('ulimit -l' as root).

System specs: Installed Physical Memory (RAM) 64.0 GB Processor 12th Gen Intel(R) Core(TM) i7-12700K, 3600 Mhz, 12 Core(s), 20 Logical Processor(s)

Benchmark results:

llama-gpt-llama-gpt-api-7b-1  | llama_print_timings:        load time = 31486.69 ms
llama-gpt-llama-gpt-api-7b-1  | llama_print_timings:      sample time =    33.00 ms /    34 runs   (    0.97 ms per token,  1030.21 tokens per second)
llama-gpt-llama-gpt-api-7b-1  | llama_print_timings: prompt eval time = 31485.54 ms /    83 tokens (  379.34 ms per token,     2.64 tokens per second)
llama-gpt-llama-gpt-api-7b-1  | llama_print_timings:        eval time = 574080.73 ms /    33 runs   (17396.39 ms per token,     0.06 tokens per second)
llama-gpt-llama-gpt-api-7b-1  | llama_print_timings:       total time = 606068.42 ms
llama-gpt-llama-gpt-api-7b-1  |

Sorry if I'm missing something obvious, I've been troubleshooting for multiple hours now with no luck.

mayankchhabra commented 1 year ago

That generation speed is indeed very slow considering your hardware. Can you confirm if you're running this with Docker on the host and not inside a VM? Can you also share the output of (while LlamaGPT is running):

docker exec -it llama-gpt-2-llama-gpt-api-7b-1 cat /proc/cpuinfo
VivaWolf commented 1 year ago

Not using a vm, here's the output from that command:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 151
model name      : 12th Gen Intel(R) Core(TM) i7-12700K
stepping        : 2
microcode       : 0xffffffff
cpu MHz         : 3609.597
cache size      : 25600 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves gfni vaes vpclmulqdq rdpid fsrm flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 7219.19
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 151
model name      : 12th Gen Intel(R) Core(TM) i7-12700K
stepping        : 2
microcode       : 0xffffffff
cpu MHz         : 3609.597
cache size      : 25600 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves gfni vaes vpclmulqdq rdpid fsrm flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 7219.19
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
VivaWolf commented 1 year ago

Potential duplicate of #47, my bad.

mayankchhabra commented 1 year ago

That is weird. I've not investigated much into why it's only detecting 2 cores and not 12. For a quick workaround, you might be able to speed it up by setting n_threads=20 in api/run.sh:

https://github.com/getumbrel/llama-gpt/blob/87dfbe265cf88e3da24f404bda9153cf344b4014/api/run.sh#L40

Let us know if that works! If it's still only pegging 2 cores to 100% and not utilizing other cores, then it could be worthwhile checking your Docker configuration to see if it's somehow limiting CPU access to containers.

Re the memory allocation error, that's just a warning and can be ignored. If you want to get rid of it, we just added memlock to the 7B model, so update your installation with git pull origin master and try running the model again.