elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.52k stars 24.34k forks source link

Storage of Inverted Index Lists in Elasticsearch #106742

Open yangyujieqqcom opened 3 months ago

yangyujieqqcom commented 3 months ago

Inverted index, which maps terms to document lists, is a common data structure. In Elasticsearch (ES), the unique identifier (_id) for documents is typically a UUID (Universally Unique Identifier) type string by default, not a numeric type. However, at the underlying level, Elasticsearch does generate a numeric type unique identifier (_uid) for each document, which can be used for optimization techniques such as compression algorithms, RBM algorithms, or bitset mechanisms.

Specifically, Elasticsearch internally employs a method called Globally Unique Identifier (GUID) to generate a unique identifier (_uid) for each document. This UID is a 128-bit number usually represented as a hexadecimal string, but it can be converted into a numeric type for the purpose of compression algorithms’ optimization.

For optimization techniques like compression algorithms, RBM algorithms, or bitset mechanisms, the UID can be converted into a numeric type. For instance, converting the hexadecimal UID into an integer type and then applying compression algorithms or other optimization techniques. This helps reduce storage space and improves query performance.

In summary, while the document unique identifier (_id) in Elasticsearch defaults to a UUID type string, at the underlying level, a numeric type unique identifier (_uid) is additionally generated for each document, which can be utilized for various compression algorithms and optimization techniques.

### Tasks
elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)