ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.35k stars 8.77k forks source link

OpenSHMEM support #4570

Closed ct-clmsn closed 3 months ago

ct-clmsn commented 6 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

Add OpenSHMEM support for distributed memory deployments. The implementation can be found in #4571 .

Motivation

OpenSHMEM implements a Partitioned Global Address Space (PGAS) from a cluster of machines. OpenSHMEM's cluster abstraction is the Parallel-Random-Access-Machine (PRAM). OpenSHMEM, being a PRAM abstraction, means applications are written using the Single-Program-Many-Data (SPMD) style; similar to MPI.

OpenSHMEM is a shared memory machine abstraction for a cluster. The shared-memory machine abstraction means distributed communications operate like memory copies (memcpy). The receiver does not get a "notification" that communication events have occurred. Senders and recievers can "put" and "get" to remote memory at will. OpenSHMEM is a single-sided communication model that tends to yield improved performance for certain applications. The caveat to that statement is the underlying hardware, software, and communication protocol layers.

OpenSHMEM operates best when the communication protocol is "fire and forget" (similar to UDP). OpenSHMEM operates best on systems with remote-direct-memory-access (RDMA) enabled network-interface-cards (NICs). OpenSHMEM can work over a commodity ethernet cluster. OpenSHMEM can work on a single machine using a shared memory backend.

Inferencing with OpenSHMEM will not necessarily yield significant strong scaling effects. This is a byproduct of the sequential nature of inferencing.

OpenSHMEM will let you run larger models, using a cluster configured as a parallel machine, than would otherwise fit into the memory (RAM) of a single machine; and communications can be potentially faster over both commodity ethernet and RDMA NICs.

Possible Implementation

The implementation would create a slice of symmetric memory from the OpenSHMEM symmetric heap. The symmetric memory will be used as a communication buffer between machines. The slice of symmetric memory will be N bytes * the number of processing elements (PE) available. A PE is similar to MPI rank. This creates symmetric memory segments that a PE can use to write data to other PEs. Each PE gets a segment of its own on all PEs. This choice simplifies the design and implementation; removes a potential point of contention. The design would follow the existing MPI support implementation. The OpenSHMEM implementation would aim to replace the MPI calls with OpenSHMEM calls.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.