antimatter15 / alpaca.cpp

Locally run an Instruction-Tuned Chat-Style LLM
MIT License
10.25k stars 910 forks source link

CPU limit slowing 30B, memory pool limit #154

Closed clover1980 closed 2 months ago

clover1980 commented 1 year ago

First of all guys i want to thank you for bringing such great instrument into people's hands. Half of planet countries blocked from ChatGPT already, many still forgetting this. Secondly, i advice everyone not to waste time with 7 and 13 models, real ChatGPT experience started only from 30B model, it can hold discussion pattern, have somewhat short memory of spoken earlier things and if you shame it by mistakes (like it can't determine the current time always) it can make all into joke (13B model can't do anything of this). I need to say it's incredibly optimized, i wasn't able to run a GPT2 1.5 billionth model even on GPU for comparison, only 774million. 13B somewhat equal in amount of gibberish to 774mln GPT2 by my opinion.

Now about problems. There's certainly present a CPU limit, maybe for low-end hardware (because with faster speed it's use of Ram will grow also faster, 30B growing to 24-25Gb)? On 13B model it used 17% of CPU and on 30B model it continues to use only 17% CPU max. This limit seriously ruined all experience with 30B by making it slower x2 than 13B in response time and even writing the words speed (it's writing like some ancient IBM machine). For powerful hardware limit must be removed, i have plenty resources with 128Gb RAM in quad channel mode, 14 cores Xeon (30B uses on my machine with Google Chrome totally 20% of RAM). But i don't see any way to remove CPU limit, your files are pure machine code. Also there's some limit with memory which makes 30B crashing after some volume of work, it always abruptly ending the discussion like on 5-7th prompt with this message: _ggml_new_tensorimpl: not enough space in the context's memory pool (needed 537269808, available 536870912)

Path-Seeker commented 1 year ago

Run it with option -t and amount of threats you want. Yesterday I had the same issue with my 13700K, but after running it with 20 threats, it's actually a lot faster

Terristen commented 1 year ago

I don't know what you mean by a CPU limit, but I also have had identical memory crashes with the error you listed. I'm running 64GB ram and during session I'm seeing 90%+ utilization of memory. On the CPU side, using 20 threads of my 24 available made 30B more usable. There's still delay, for sure.

The memory issue did not change trying to allocate more swap file space. (Though I didn't expect it to.)

I wonder if people with more base ram can have longer sessions before the memory exit.

In the spirit of the OP question, is there any way to run these models on GPUs locally instead of the CPU?

Patjwmiller commented 1 year ago

I don't know what you mean by a CPU limit, but I also have had identical memory crashes with the error you listed. I'm running 64GB ram and during session I'm seeing 90%+ utilization of memory. On the CPU side, using 20 threads of my 24 available made 30B more usable. There's still delay, for sure.

The memory issue did not change trying to allocate more swap file space. (Though I didn't expect it to.)

I wonder if people with more base ram can have longer sessions before the memory exit.

In the spirit of the OP question, is there any way to run these models on GPUs locally instead of the CPU?

as of right now, the only repo that I know of that currently supports GPU is: https://github.com/tloen/alpaca-lora

clover1980 commented 1 year ago

Run it with option -t and amount of threats you want. Yesterday I had the same issue with my 13700K, but after running it with 20 threats, it's actually a lot faster

Thanks, -t 20 helps with speed, it uses now 81% of CPU and much faster in reaction and writing. (CPU temp now 80 degrees but it's normal for Xeon and my 8 pipes radiator) 30B is a gold, it's mocking OpenAi telling me to contact their support for problems and quite interesting info about Microsoft's XiaoIce :)

The only left is memory pool problem, on 18th prompt it crashing _ggml_new_tensorimpl: not enough space in the context's memory pool (needed 536905968, available 536870912)

30B answer: You can increase your memory pool by using a larger GPU.

My specs: Win 10 1903 x64, all default. Asrock X99, Intel Xeon 2.20Ghz 14 cores 28 threads engineering sample from Ali, 128Gb RAM from 32x4, RTX 2070 Super, 2 x some 256Gb SSD.

Path-Seeker commented 1 year ago

Thanks, -t 20 helps with speed, it uses now 81% of CPU and much faster in reaction and writing. (CPU temp now 80 degrees but it's normal for Xeon and my 8 pipes radiator) 30B is a gold, it's mocking OpenAi telling me to contact their support for problems and quite interesting info about Microsoft's XiaoIce :)

You can state any amount of threads, I have a 12-core CPU with 24 threats, so I'm using -t 20 (to leave 4 threats for the system and other apps), with your 28 threats available you can probably use 24 threats for Alpaca (or even 28, but I didn't try the max amount of threads, not sure what will happen), which should increase the performance (and CPU load)

The only left is memory pool problem, on 18th prompt it crashing _ggml_new_tensorimpl: not enough space in the context's memory pool (needed 536905968, available 536870912)

I faced this problem as well, but never went deeper into the investigation. Not sure, but probably these pull requests could be a fix: https://github.com/antimatter15/alpaca.cpp/pull/142 https://github.com/antimatter15/alpaca.cpp/pull/126

jeffwadsworth commented 1 year ago

Hmm. For the GPU version mentioned above, wouldn't you have to have one of those A100's with 80GB to utilize the 30B model? If not, that would be incredible. The speed using my 12 core AMD is fine, though. It has a short memory, but its reasoning skills and story-telling are amazing. It doesn't hallucinate as much as the 7 and 13's do, which is nice. It also has a keen sense of humor with the smile faces if asked a quirky question. Can't wait to see what they come up with in a year.

a904guy commented 1 year ago

From llama.cpp

Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

model original size quantized size (4-bit)
7B 13 GB 3.9 GB
13B 24 GB 7.8 GB
30B 60 GB 19.5 GB
65B 120 GB 38.5 GB