Open dfengpo opened 6 months ago
Are you using 32 CPU or a CPU with 32 cores?
32 CPU
My server and system : 32vCPUs | 64GiB | c6s.8xlarge.2 Ubuntu 22.04 server 64bit
What model are u using and could you please provide the performance data including latent and generation speed?
What type of memory does your server have? Language models are usually limited mostly by memory bandwidth.
That seems a shared 1 CPU with 4 cores and 8 threads to me (=32vCPU). This is not enough for LLMs. The utilization will also depend on some settings of the virtual environment and of course the type of CPU... If you want to stay on CPU (for example, because of budget), then try more vCPUs.
Here is how to calculate: n Threads x m Cores x k CPU = vCPU , for example, 16 cores with 32 threads 1 CPU = 512 vCPU (a decent setup for LLMs). With 32 it is normal that it is slow.
I have just experienced an other problem related to your question. When using CPU the number of spanned threads is about 50% of the available threads. The reason is probably a bug in the automatic setting of the threads when Threads = null or 0 in ModelParams. If you set the number of threads manually, then all threads are allocated (can reach 100% CPU utilization). See ModelParams for more details.
@zsogitbe That's a good catch! Did you set the threads number the same with the number of vCPU? Maybe there's a way to detect the best configuration for it?
That's actually intentional, an approximation copied from llama cpp. CPU utilisation isn't the right thing to measure, you need to look at tokens per second, if you're memory bound adding more threads will just waste more CPU time!
Unfortunately, it is not good. You cannot be memory bound with computers of today. If you have 20GB memory is already enough. But most have 64 GB. Or you need to test for the amount of memory...It is a bug.
Sorry by "Memory bound" I didn't mean quantity, it would have been more correct to say memory bandwidth bound. That's usually the limiting factor for LLMs
No problem. For to be sure, I have tested your assumption. I have used the default Threads=null setting and then I have set Threads to the max. cores on the computer (18 cores). The second setup was twice as fast than the first one which used only 9 cores. It is clear that the default number of Threads is not set well (bug).
This problem is that this is extremely hardware dependent. For example on my own PC (16 physical cores with hyperthreading so 32 cores):
threads | time |
---|---|
1 | 72s |
4 | 19s |
8 | 11s |
16 | 11s |
32 | 14s |
As you can see it scales a bit as you add more threads, gets to a sweet spot at around half the core count and then actually gets slower past that point. It's still completely maxxing out all the cores, so it's wasting a massive amount of cycles.
The current guesstimate (and it definitely is just that, this is why it's configurable!) we're using for the default comes from here.
Thanks for the clarification Martin. I think a method to automatically detect best configuration may be something useful, though it should not be of high priority for us now.
For reference (if anyone wants to modify it) the default is implemented here: https://github.com/SciSharp/LLamaSharp/blob/master/LLama/Extensions/IContextParamsExtensions.cs#L53
I am sorry Martin, but we do not agree in this. Your results are strange for 8 and 16! There is some strange constraints on your PC. More than 16 has no sense, don't do it!The link you provided tells actually the same as me (maybe they changed it recently): "-t N, --threads N: Set the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has..." the same for batch processing...
To be clear I have 32 logical cores (i.e. Environment.ProcessorCount == 32
), so that's why I tested all the way to 32 (I'm using a Ryzen 7950X).
For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has
(emphasis mine)
Note that Environment.ProcessorCount
is logical cores, not physical cores.
Unfortunately as far as I'm aware dotnet doesn't have a way to get physical cores directly, so the Environment.ProcessorCount / 2
default is just a best guess at the number of physical cores (most CPUs have 2x hyperthreading, so divide by 2).
I did some searching around, but it looks like it's not easy to get physical core count (especially once you start thinking about things like processor affinity).
In that case we have found the problem and actually we were agreeing all the time about the issue :).
Here it is how to do it:
ManagementObjectSearcher searcher = new ManagementObjectSearcher("SELECT * FROM Win32_Processor");
List<int> coresCPUs = new List<int>();
foreach (ManagementObject mo in searcher.Get())
{
coresCPUs.Add(Convert.ToInt32(mo.Properties["NumberOfCores"].Value));
}
The list is only needed if you expect more than one processor on the system.
That's interesting! Definitely looks like it could be close to what we want.
Do you know how this behaves on Linux/MacOS (i.e. does it run but return no results, or does it simply not compile)?
I have no idea, not using Linux/MacOS. But even if it does not work directly, the Net framework should have something similar there too, I guess. In any case, you have it for Windows already.
How to accelerate running speed in CPU environment?
I am running on 32CPU and 64GiB of memory, which is very slow。 But the CPU utilization rate has not reached 100%