[ML] Inference API chunking large documents - Githubissues

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

1.43k stars 24.87k forks source link

[ML] Inference API chunking large documents #106185

Open jonathan-buttner opened 8 months ago

jonathan-buttner commented 8 months ago

Description

Large documents need to be chunked otherwise tokens exceeding the model's limit won't be used.

MVP for default word based chunking strategy:

Use a sliding window approach
Chunk into 200 words
Try splitting on whitespace or newline if possible
Maybe fallback to only doing it based on character length (don't have to search for whitespace/newlines, and also avoids problems with languages that don't really use whitespace)

MVP for configurable chunking settings:

Allow users to provide chunking configurations to select between word and sentence based chunking strategies when creating an inference endpoint.
Fallback to default word based chunking strategy above

Post MVP features for configurable chunking settings:

Allow users to enable chunking when calling to perform an inference through the Inference API
Allow users to configure chunking as part of their ingestion pipeline

Tasks already completed:

[X] Create default word based chunking strategy https://github.com/elastic/elasticsearch/pull/107303
[X] Add chunking configuration to all services
[X] Update default strategy to sentence based chunking strategy
[X] Add chunk size upper and lower limits
[x] Remove feature flag

elasticsearchmachine commented 8 months ago

Pinging @elastic/ml-core (Team:ML)