Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.39k stars 982 forks source link

Bug: NUMA support on Windows #502

Open BernoulliBox opened 2 months ago

BernoulliBox commented 2 months ago

Contact Details

git@nic.cix.co.uk

What happened?

Improved NUMA support on Windows

When using Llamafile on a NUMA (Non-Uniform Memory Access) Windows system, it's crucial for optimal performance that users can control which node Llamafile loads into memory on, especially when loading multiple models simultaneously. Currently, Llamafile appears to override the NODE option specified in the Windows start command, always filling the memory of node 0 before utilizing node 1, regardless of the specified node. There is no problem assigning CPU cores from other nodes, the problem is only related to loading the model into RAM into a specified node.

Current Behavior

Llamafile ignores the /NODE option specified in the Windows start command. Memory is always filled on node 0 first, then node 1, regardless of the specified node. This behavior causes performance issues when running multiple models on NUMA systems.

Expected Behavior

Llamafile should respect the /NODE option specified in the Windows start command. Memory allocation should prioritize the specified node. This would allow users to effectively distribute model loads across NUMA nodes for optimal performance.

Example

Currently, when using these Windows commands:

start /NODE 0 llamafile.exe ... llm_1.gguf start /NODE 1 llamafile.exe ... llm_2.gguf

Both instances will load into node 0's memory, which is not expected.

Impact

This issue significantly impacts performance when running multiple models on NUMA systems. It prevents full utilization of available cores due to the relatively slow interconnect between nodes when memory is not local to the executing cores.

Proposed Solution

Either (1) Modify Llamafile to respect the Windows start command's /NODE option, OR (2) Implement a command-line option for Llamafile (e.g., --numa-node) to specify the preferred NUMA node directly.

Ensure that memory allocation prioritizes the specified node before utilizing other nodes.

Additional Context

This improvement would be a big improvement for users running multiple LLM instances on multi-socket workstations or servers running Windows, enabling the best distribution of workload and best utilization of hardware.

It aligns with Llamafile's goal of providing efficient, flexible LLM deployment options.

Environment

OS: Windows 10 LTSC Llamafile version: 0.8.9 Hardware: Dual-socket workstation with Intel Xeon processors (Broadwell), 256GB RAM per node

Version

Llamafile version: 0.8.9

What operating system are you seeing the problem on?

No response

Relevant log output

No response