Closed bra-fsn closed 5 months ago
Agreed. I think one reasonable way to implement it would be to keep using a _source
stored field, but organize it in such a way that the first bytes describe the document structure. Since Lucene happens to compress documents into 16KB blocks (there can be multiple docs in a single block), we would only pay the price for the document structure once per block if all documents have the same structure.
I'd played around with doing something like this a while ago! I never finished it but it was a fun thing to look at.
Would it be a bad idea to factor out doc storage to a key value database, which may be more suited for this task?
I don't think that would make things easier. We have quite complicated visibility guarantees that would be hard to meet without a built-in document store like we have today I think.
Implementing #27374 would make this partly solved (with some restrictions coming from the doc values storage).
Pinging @elastic/es-distributed (Team:Distributed)
In version 8.4 Elasticsearch introduced a new feature called synthetic _source that can significantly reduce the index size by rebuilding the _source of documents from doc values.
There are of course trade offs and restrictions compared with keeping the original source of documents, but that is the best compaction of document sources Elasticsearch can provide.
I'm going to close this issue.
AFAIK, elasticsearch stores "whole" JSON documents on disk, meaning an integer will be represented by its string value and every field name will take space on every stored document. Other databases use more compact storage formats and do various other tricks. For an example, see ArangoDB: https://www.arangodb.com/2012/07/collection-disk-usage-arangodb/ "ArangoDB separates the document structure and the actual document data when saving a document. Document structure information, consisting of attribute names and attribute data types, is stored as so-called “shapes”. The document data stored will only contain a shape-id (a reference to an existing shape), and multiple documents can point to the same shapes. This helps in reducing disk usage when many or even all documents in a collection have the same structure." And: https://github.com/arangodb/velocypack.
Given that elasticsearch has a pretty fixed "schema" (defined by mappings), using these techniques could help storage needs to be lowered significantly, possibly opening the way for new usage cases (for example more effectively storing blobs in integer lists).