The current Elasticsearch documentation describes that scaling throughput by adding more allocations to a deployment allows for more parallel inference requests and that all allocations assigned to a node share the same copy of the model in memory.
Throughput can be scaled by adding more allocations to the deployment; it increases the number of inference requests that can be performed in parallel. All allocations assigned to a node share the same copy of the model in memory. The model is loaded into memory in a native process that encapsulates libtorch, which is the underlying machine learning library of PyTorch. The number of allocations setting affects the amount of model allocations across all the machine learning nodes. Model allocations are distributed in such a way that the total number of used threads does not exceed the allocated processors of a node.
However, in practice, each additional allocation requires extra memory, and this increase appears to be linear with the number of allocations. and finally, we will reach the memory limitation by scale up allocation
Description
The current Elasticsearch documentation describes that scaling throughput by adding more allocations to a deployment allows for more parallel inference requests and that all allocations assigned to a node share the same copy of the model in memory.
However, in practice, each additional allocation requires extra memory, and this increase appears to be linear with the number of allocations. and finally, we will reach the memory limitation by scale up allocation