b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

How about the multi-core support of stand-alone dual-socket motherboards? #19

Open win10ogod opened 3 months ago

win10ogod commented 3 months ago

How about the multi-core support of stand-alone dual-socket motherboards?

b4rtaz commented 3 months ago

Hello @win10ogod! I've never worked with dual-socket motherboards. How are these processors visible in the system? If the system can synchronize threads of one application on two processors without any special adjustments, it should work with no problem. If not, you can create two pods/dockers assigned to each processor, then connect Distributed Llama inisde these containers via a local network.

intari commented 1 month ago

@b4rtaz Dual-socket mbs are usually present CPUs as 2 (or more if CPUs themselves are high-core-count) NUMA nodes. This sometimes configure in BIOS (look for Cluster-on-Die and like options). Only difference is that it's...unwise... to access memory in "other" NUMA node if this can be avoided, it would slower than direct access to local memory because of how their memory controllers work. As in "10 GB/s instead of 100 GB/s slower" in worst cases. Same thing could be important for really high-core CPUs on one socket MBs (like some threadrippers), they also have several NUMA nodes.

I do have dualsocket MB with 2 Xeon E5 2680 v4. Got MB and CPUs rather cheap from aliexpress. 28 cores / 56 threads. I use it as proxmox node with 220 Gb RAM(would be 256 when I replace 8th DIMM).

DifferentialityDevelopment commented 1 month ago

I'm actually curious here, say you do have two VM's on the same machine and both have a CPU dedicated for it, would the memory also be distributed in such a way that the memory each has access to is those that it has the fastest access to?

intari commented 1 month ago

@DifferentialityDevelopment hypervisor should take this into account. If size of VM(memory+NumberOfCores) is less than one NUMA node size - it would be no problem. If it doesn't fill into one NUMA node - it's usually better to ask hypervisor to pass NUMA layout to VM and let it sort this. It's also important how do you insert DIMMs in such MB if you don't fill all slots.

if you ignore all of this - it would just work but it could run slower than it possible