[ML] Inference Request Batching

jonathan-buttner commented 9 months ago

Description

The current implementation of the Inference API is to send each request individually as they are received. There are adjustable limits to how many requests can be sent concurrently. The limit is set to 20 by default. To increase performance, we can batch requests together.

Batching requires the individual requests to be routed to the same destination and contain the same user identifiable information. The user identifiable information would be fields like the api key, organization (for open ai), user (for open ai).

The batching service will wait for a period of time for a buffer to be filled to a configurable size. If the buffer reaches the configured size the requests are batched and sent. If the wait period elapses the requests within the buffer are batched and sent. The wait period will be configurable.

Configurations

Ideally users would be able to configure the batch size and wait period per inference configuration. This is useful because cohere limits the batch size to 96 and open ai does not have a known limit. The current design limits this because a batching service is created per inference service (one for open ai, cohere, etc).

The initial implementation will allow adjusting the batch size and wait period for all inference configurations using a cluster setting.

A follow up improvement will allow adjusting the batch size and wait period per inference service.

Only batch for ingest

The initial implementation will likely batch all requests regardless of whether they're for ingest or search. A follow up improvement will allow separate logic to exist for search requests so they are sent immediately instead of having to wait.

elasticsearchmachine commented 9 months ago

Pinging @elastic/ml-core (Team:ML)

jimczi commented 9 months ago

Thanks for sharing @jonathan-buttner. I agree that batching is necessary but I wonder if it should be the sole responsibility of the inference service. Our main entry point for ingestion is the _bulk API, an API for batching. So we already have a way to group documents that could be batched together in the inference service. By allowing only individual requests in the inference service we're loosing this information. Why not exposing a bulk API in the inference service first? How a bulk is adjusted for optimal batch size and parallel execution would still be an implementation detail of the service implementation but limited to the size of the bulk. We can also imagine optimising this model by allowing batching of independent bulk requests later if we realise that the natural batching of the ingestion path is not enough. The wait period that is proposed in this issue seems to be related to a lack of grouping client side. I don't think we have this problem.

jonathan-buttner commented 9 months ago

Thanks for your feedback @jimczi !

Why not exposing a bulk API in the inference service first?

I think the piece I'm misunderstanding here is how a new bulk API in the Inference service would work with the Inference Ingest Processor. My understanding is that the Inference Ingest Processor only receives one document at a time and then we forward that request to the Inference service. If we create a new bulk API in the Inference service we'd need to leverage that within the Inference Ingest Processor. That would move the batching logic to the Ingest Processor right?

I'm not opposed to moving it there but I just wanted to make sure I'm understanding the flow correctly.

jimczi commented 9 months ago

My understanding is that the Inference Ingest Processor only receives one document at a time and then we forward that request to the Inference service.

Right, I was more thinking of the integration with the new semantic_text field. There we control the entire bulk and can decide how to group documents before sending to the inference service. With this new capability we won't need to leverage the ingest pipeline to perform the inference so we should take this opportunity to design the optimal flow. Note that batching for the inference processor could also be implemented at the node level. That's what the enrich processor does and was proposed here. Batching client side is also a way to ensure that we don't overload the inference service with independent requests within a single bulk.

jonathan-buttner commented 9 months ago

Right, I was more thinking of the integration with the new semantic_text field.

Ah I see. The inference API currently supports sending an array of input text. I wonder if that would satisfy what the semantic_text field would need for a bulk operation.

The current implementation (that accepts an array of input text) requires that the request specify whether it is for search or ingest. Do you think the semantic_text flow would want to batch together search requests and ingest requests? Or is there another piece that the array of input text doesn't satisfy?

If we did want to combine search and ingest requests, a new bulk inference API that accepts both could be useful. Internally it'd depend on the model as to whether the bulk request would need to be split up. For example, cohere would require separate requests for search and ingest, but I don't think that's the case for openai.

Note that batching for the inference processor could also be implemented at the node level. That's what the enrich processor does and was proposed https://github.com/elastic/elasticsearch/issues/103665.

Ah thanks for that, I didn't realize that's what the enrich processor does.

elastic / elasticsearch