SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.46k stars 331 forks source link

How to accelerate running speed in CPU environment? #562

Open dfengpo opened 6 months ago

dfengpo commented 6 months ago

How to accelerate running speed in CPU environment?

I am running on 32CPU and 64GiB of memory, which is very slow。 But the CPU utilization rate has not reached 100%

AsakusaRinne commented 6 months ago

Are you using 32 CPU or a CPU with 32 cores?

dfengpo commented 6 months ago

32 CPU

My server and system : 32vCPUs | 64GiB | c6s.8xlarge.2 Ubuntu 22.04 server 64bit

AsakusaRinne commented 6 months ago

What model are u using and could you please provide the performance data including latent and generation speed?

martindevans commented 6 months ago

What type of memory does your server have? Language models are usually limited mostly by memory bandwidth.

zsogitbe commented 6 months ago

That seems a shared 1 CPU with 4 cores and 8 threads to me (=32vCPU). This is not enough for LLMs. The utilization will also depend on some settings of the virtual environment and of course the type of CPU... If you want to stay on CPU (for example, because of budget), then try more vCPUs.

Here is how to calculate: n Threads x m Cores x k CPU = vCPU , for example, 16 cores with 32 threads 1 CPU = 512 vCPU (a decent setup for LLMs). With 32 it is normal that it is slow.

zsogitbe commented 6 months ago

I have just experienced an other problem related to your question. When using CPU the number of spanned threads is about 50% of the available threads. The reason is probably a bug in the automatic setting of the threads when Threads = null or 0 in ModelParams. If you set the number of threads manually, then all threads are allocated (can reach 100% CPU utilization). See ModelParams for more details.

AsakusaRinne commented 6 months ago

@zsogitbe That's a good catch! Did you set the threads number the same with the number of vCPU? Maybe there's a way to detect the best configuration for it?

martindevans commented 6 months ago

That's actually intentional, an approximation copied from llama cpp. CPU utilisation isn't the right thing to measure, you need to look at tokens per second, if you're memory bound adding more threads will just waste more CPU time!

zsogitbe commented 6 months ago

Unfortunately, it is not good. You cannot be memory bound with computers of today. If you have 20GB memory is already enough.  But most have 64 GB. Or you need to test for the amount of memory...It is a bug.

martindevans commented 6 months ago

Sorry by "Memory bound" I didn't mean quantity, it would have been more correct to say memory bandwidth bound. That's usually the limiting factor for LLMs

zsogitbe commented 6 months ago

No problem. For to be sure, I have tested your assumption. I have used the default Threads=null setting and then I have set Threads to the max. cores on the computer (18 cores). The second setup was twice as fast than the first one which used only 9 cores. It is clear that the default number of Threads is not set well (bug). Screenshot 2024-03-09 152729

martindevans commented 6 months ago

This problem is that this is extremely hardware dependent. For example on my own PC (16 physical cores with hyperthreading so 32 cores):

threads time
1 72s
4 19s
8 11s
16 11s
32 14s

As you can see it scales a bit as you add more threads, gets to a sweet spot at around half the core count and then actually gets slower past that point. It's still completely maxxing out all the cores, so it's wasting a massive amount of cycles.

The current guesstimate (and it definitely is just that, this is why it's configurable!) we're using for the default comes from here.

AsakusaRinne commented 6 months ago

Thanks for the clarification Martin. I think a method to automatically detect best configuration may be something useful, though it should not be of high priority for us now.

martindevans commented 6 months ago

For reference (if anyone wants to modify it) the default is implemented here: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Extensions/IContextParamsExtensions.cs#L53

zsogitbe commented 6 months ago

I am sorry Martin,  but we do not agree in this. Your results are strange for 8 and 16! There is some strange constraints on your PC. More than 16 has no sense, don't do it!The link you provided tells actually the same as me (maybe they changed it recently): "-t N, --threads N: Set the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has..." the same for batch processing...

martindevans commented 6 months ago

To be clear I have 32 logical cores (i.e. Environment.ProcessorCount == 32), so that's why I tested all the way to 32 (I'm using a Ryzen 7950X).

For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has

(emphasis mine)

Note that Environment.ProcessorCount is logical cores, not physical cores.

Unfortunately as far as I'm aware dotnet doesn't have a way to get physical cores directly, so the Environment.ProcessorCount / 2 default is just a best guess at the number of physical cores (most CPUs have 2x hyperthreading, so divide by 2).

I did some searching around, but it looks like it's not easy to get physical core count (especially once you start thinking about things like processor affinity).

zsogitbe commented 6 months ago

In that case we have found the problem and actually we were agreeing all the time about the issue :).

Here it is how to do it:

ManagementObjectSearcher searcher = new ManagementObjectSearcher("SELECT * FROM Win32_Processor");
List<int> coresCPUs = new List<int>(); 
foreach (ManagementObject mo in searcher.Get()) 
{ 
     coresCPUs.Add(Convert.ToInt32(mo.Properties["NumberOfCores"].Value));
}

The list is only needed if you expect more than one processor on the system.

martindevans commented 6 months ago

That's interesting! Definitely looks like it could be close to what we want.

Do you know how this behaves on Linux/MacOS (i.e. does it run but return no results, or does it simply not compile)?

zsogitbe commented 6 months ago

I have no idea, not using Linux/MacOS. But even if it does not work directly, the Net framework should have something similar there too, I guess. In any case, you have it for Windows already.