Closed jpountz closed 7 years ago
My vote would be for number three or four:
Option 3: Add SORTED doc values to _type and _id Option 4: Add SORTED doc values to _type and BINARY to _id
With #14783 we already enable doc values for _type
, so it makes sense to individually call out the _id
as well. This also allows changes to happen to _type
without necessarily breaking _id
.
In my experience, most users do not use random sorting, but sorting on _id
is not very common either. With only sorting in mind, I would expect to see _id
used for non-random sorting a lot more than for random sorting. However, for other use cases, such as referencing the _id
in other scenarios that use fielddata (e.g., rarely, but sometimes in aggregations, as well as scripts), it may tip it in favor of being binary.
There is a huge difference between _type
and _id
when it comes to the expense of docvalues.
_type
is a lowish cardinality field. This means if you have only 2 unique values foo
and bar
, lucene will deduplicate this and write 1 bit per document. If you only have 1 unique type
(also common), we will write 0 bits per document, each segment just has in the metadata "all docs have value foo
: it costs nothing). So for 10M documents with 2 types, docvalues for this field costs a little over a megabyte.
On the other hand unique ids are high cardinality by definition: deduplication does nothing. Either choice is extremely costly in comparison. Lets consider 10M documents with ids of length in bytes 16 each and make some guesses:
I just want to make it clear this is apples and oranges. The fact we turned on docvalues for type is irrelevant when it comes to unique ids. We need very strong use cases and features IMO if we are going to incur this cost.
Very good info @rmuir, as usual. It makes me think that _id
supporting doc values should exist (particularly in light of #15155), but it should be opt-in.
I think it's actually 20 bytes for ES's auto-generated IDs (15 fully binary bytes for the Flake ID, and 20 bytes once it's Base64 encoded) ... but, yeah, this would be a big cost ...
Why do we base64? This probably bloats the terms dict today.
It makes me think that _id supporting doc values should exist
Why can't a user store this in their own field if they want to do something crazy with it? I don't think we should add back configurability for metadata fields, even if it is just one. It was a lot of work to remove that (#8143), and these are our fields, for internal use by elasticsearch. Edge cases like described in #15155 can be handled by a user field with doc values enabled, if they want to do such a crazy thing.
But edge cases like #15155 cannot be handled without some other special handling because it's the access of the _id
that is the slowdown. Adding a doc value field does not bypass that cost.
Hi all! @pickypg linked this issue to me because he knows it's near and dear to my heart.
My exact use case (shameless plug: @zombodb: https://github.com/zombodb/zombodb) is actually what y'all are describing as an "edge case" in #15155 -- that is, ES is being used as a searching index only (ie, store=false, _source=disabled), and an external "source of truth" (Postgres) is used to provide document data back to the user.
While @zombodb might be unique in implementation, I doubt its general approach of providing _id
values and using them to later lookup records in an external source is.
An implementation detail is that @zombodb, through a REST endpoint plugin, uses the SCAN+SCROLL API to retrieve all matching _id
values, re-encodes them as 6byte pairs, and streams them back as a binary blob.
Against ES v1.7 (and 1.6 and 1.5), benchmarking has shown that the overhead of simply retrieving the _id
value completely swamps searching and even the String-->byte encoding ZDB does, so I'm excited y'all are looking at ways to make this better.
(as an aside, I've actually spent quite a bit of time debugging this (against 1.5), and found that if a parent<-->child mapping exists, using its cache to lookup the _id
by ordinal (bypassing Lucene, decompressing, and decoding the _id
) is nearly an order of magnitude faster. I gave some patches to @pickypg awhile back through my employer's support agreement, but we all kinda decided it wasn't worth the effort of integrating into ES because v2.0 was near and changed everything.)
The idea that such things can "be handled by a user field with doc values enabled" isn't really true, as @pickypg pointed out, because ES is still doing all the work to retrieve the _id
value for each hit.
So a half-baked idea would be: What if retrieving the _id
could be disabled on a search-by-search basis? Instead, the search request would specify a "user field with doc values enabled" that is a copy of the _id
value. Maybe more generally, the ability to elide returning all the fields that are deemed "for internal use by elasticsearch"?
So I experimented with this idea (disabling returning _id and _type) against v1.7 (I'm not in a position to work with v2.x yet).
All I did was quickly hack FetchPhase.java
to set the fieldsVisitor
to null and then guard against that in the places it's used, and hardcoded both the "type" and "id" properties of the SearchHit
to the empty string.
I then setup a little benchmark using @zombodb.
With a query that returns 14k documents, retrieving all the "ids" in a SCAN+SCROLL loop:
Stock ES: 17 per second Hacked Version: 120 per second
Of course, all the ids were blank, so it's not very useful!
I then added a doc_values=true
field to the index that contains a copy of the _id
field. Against the hacked version, I was able to sustain 104 per second. That's about a 6x gain. There's definitely quite a bit of overhead in uid decoding.
In case you care how I hacked FetchPhase.java: https://gist.github.com/eeeebbbbrrrr/9af88e6dc88943450c73
With a query that returns 14k documents
You should just return the top-N instead. That is what lucene is designed to do.
You should just return the top-N instead. That is what lucene is designed to do.
The point is that there's room for significant improvement around how _uid
is handled. I was trying to show what the overhead is -- and on my test data on my laptop, it's about 6x. If a reasonable way to improve this can be found, everyone wins.
Well, lucene just isn't designed to return 14k documents, and by the way docvalues aren't designed for that either. for such huge numbers then a database is a better solution, as it is designed for those use cases.
Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.
Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.
I don't know how this is relevant.
If y'all make progress towards improving _uid
in whatever way, I'd be happy to help test and benchmark changes.
Hey, I stumbled upon this issue while I was trying to do something similar in Elasticsearch. I aimed (ambitiously) to retrieve ~1million documents in under 1 second based on a simple filter query. I noticed the unzipping of the '_id' field was taking a while (~8seconds) using the hot_threads API:
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsRe
der.java:249) org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:427)
org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:219)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:184)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:401)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:833)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:824)
So I wrote a plugin to stop retrieving the '_id' field, and just retrieve a secondary, integer, doc_values field from the document, specified in the query. I thought this would be super quick but suprisingly, it took almost the same amount of time and now, the hot_threads API showed:
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:249)
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:70)
org.elasticsearch.search.lookup.SourceLookup.extractRawValues(SourceLookup.java:145)
plugin.retrievedocvalues.search.fetch.CustomFetchPhase.createSearchHit(CustomFetchPhase.java:256)
plugin.retrievedocvalues.search.fetch.CustomFetchPhase.execute(CustomFetchPhase.java:189)
plugin.retrievedocvalues.search.CustomSearchService.executeFetchPhase(CustomSearchService.java:500)
The query I'm using is against a custom endpoint and the body is:
{ "sort": "_doc",
"_source": false,
"fields": ["foo"]
"size": 1000000,
"filter": {
"bool": {
"should": [
{
"term": {
"foo": "bar"
}
},
{
"term": {
"baz": "qux"
}
}
]
}
}
The field 'foo' is an integer field which has doc_values enabled on ES version 1.7.1. The weird thing is the aggregation on the field is super quick, but retrieving the data itself is slow.
I guess the underlying point is it may not be that much faster to enable doc_values on the '_id' field since I can't see much of an improvement, unless I'm missing something which someone here could point out?
@shamak you can use fielddata_fields
in your search request to retrieve field values from doc values (or in memory field data). Fields are meant to get stored fields with a fall back to _source (which was removed in 5.x as it is confusing):
GET _search
{
"fielddata_fields": [ "fieldname"]
}
Note though that getting 10K docs should be done with a scroll rather than getting so many docs at one.
Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields
entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz
Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz
That's essentially what fielddata_fields
is. We were talking about not using the doc values terminology in favour of in-memory vs on-disk fielddata, although I don't think that's the right tradeoff either. The "fielddata" term has history, and referring to doc values as "on-disk" does them a disservice given that they're usually cached in RAM.
So yes, maybe we should add doc_values_fields
(or just doc_values
?) as a synonym for fielddata_fields
?
Something else we could consider would be to only store the id and type in doc values and not in stored fields in order to not incur a large increase of index size. The benefit is that we would not need any new option on the mappings. However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.
However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.
I suppose then disabling _source would entirely skip stored fields which is kind of cool.
I suspect the _type is going to be cached super fast, especially if we ever decide to sort by _type. Many many use cases use a single type per index so the type lookup is just metadata. Either way I suspect you'd see closer to 2 seeks than 3. Even still, 2 is much worse than 1.
Another question: do we really need to return the _id and _type all the time? I know I typically just wanted some portion of the _source. Usually, like, two or three fields from _source and a couple of highlights. Anyway, maybe we should allow those to be disabled.
I like the idea of not always returning those fields as it's unnecessary information in a lot of cases, especially for the single _type
use case. We call it metadata, so maybe we should treat it like metadata and only return it when requested (defaulting to true).
I am fine with allowing some of those meta fields to not be returned, but I tend to like that they are returned by default: it is easy to forget that some things are not available if they are not returned by default, and it makes reindexing easier as you don't have to think about fields that you might need for reindexing: everything is there by default.
I made some tests to check the cost of adding the docvalues to the _id field. I tried to index 1M documents with one field (_id) and different configurations. I tested 3 configurations:
Configuration | Size | Docs/s | Random access (docs/s) | Sequential access(docs/s) |
---|---|---|---|---|
Stored | 12 MB | 372,000 | 532,000 | 716,000 |
BinaryDV | 26 MB | 378,000 | 9,009,000 | 40,000,000 |
SortedDV | 13 MB | 255,000 | 4,608,000 | 16,129,000 |
The binary doc values doubles the size of the index because they don't use any compression. They are very fast for accessing any values and the indexation speed is almost the same as the stored field. The sorted doc values have almost the same size than the stored field, this is due to the prefix compression that they use to store the values. They are also quite fast to access any values but the indexation is slower ( ˜= 30% slower).
Configuration | Size | Docs/s | Random access (docs/s) | Sequential access(docs/s) |
---|---|---|---|---|
Stored | 49 MB | 332,000 | 719,000 | 1,751,000 |
BinaryDV | 46 MB | 358,000 | 8,695,000 | 38,461,000 |
SortedDV | 48MB | 246,000 | 5,524,000 | 9,523,000 |
For the random id case, the size of the index is almost the same for the 3 configurations but the sorted doc values are still slower to index the data.
I ran some benchmark and the extra cost during the indexation for the sorted doc values is the sorting of the dictionary (in this case we need to do it twice, one for the terms dictionary of the postings and one for the sorted doc values). Since each _id is unique I tried to add a way to search on the sorted doc values directly, to do so I just added a file that contains the docID for each _id. It's an extra cost of 4 bytes per documents (it's faster for random access to use a full int instead of a vint or a block compression) and to search a docID it needs to retrieve the ordinal of the term first and then seek/read the docID. For existing _id the search is faster than the one that uses the postings but it can be slower when the _id does not exist and does not share a prefix with the existing ones (the latter case is optimized in the terms dictionary of the postings). I don't know if this can be something we want to explore but I wanted to propose at least one option if the extra cost of adding doc values to the _id field is prohibitive.
Thanks for testing! The hybrid postings/doc-values idea sounds appealing, but it might be challenging to expose it cleanly? (I haven't thought much about it). Otherwise I am wondering how much LUCENE-7299 would close the gap in terms of indexing speed with SORTED_SET doc values and also that maybe we should implement tome simple compression on binary doc values for such cases (eg. based on the most common ngrams).
I don't think the idea of trying to use the postings dictionary for the term dictionary will work well (besides practical concerns). It will simply be too slow.
The problem is, they are different data structures (it is like trie versus tree, but the difference is important).
The terms dictionary is optimized for lookup by "String", but the docvalues dictionary is optimized for lookup by ordinal.
The docvalues lookup by term is much slower than the postings one, because its not optimized for that. The inverse is true for lookup by ordinal: the entire datastructure is built around doing this with as little overhead as possible: it can do random access within a block, etc.
Given that even a vint for prefix/suffix length is too costly for that case, I don't think we should introduce a branch per-byte with something like n-gram compression. I have run the numbers for that on several datasets (real data: not artificial crap like IDs) and it only saves something like 25% space for that datastructure, depending on the text: in many cases lower than that.
Its important to keep seek-by-ord fast at the moment, because too much code uses sorted/sorted_set docvalues in an abusive fashion with a seek-by-ord for every document, to lookup the text. Elasticsearch has gotten a little better by incorporating things like global ordinals, but it still has bad guys like its scripting support. There are similar cases for other lucene users and even in some lucene modules itself. Historically, people wrote code expecting this to be "ok" and "fast" with fieldcache/fielddata, because that did no compression at all: not even prefix compression. A lot of this code was just ported to docvalues without addressing this, so we still have to keep it fast.
We don't want to add an option to metadata fields, and we don't want to make everyone pay the price for doc values on _id
so we will have to do without doc values on _id
.
We already use fielddata on the
_uid
field today in order to implement random sorting. However, given that doc values are disabled on_uid
, this will use an insane amount of memory in order to load information in memory given that this field only has unique values.Having better fielddata for
_uid
would also be useful in order to have more consistent sort order when paginating or hitting different replicas: we could always add a tie-break on the value of the_uid
field.I think we have several options:
_uid
_uid
_type
and_id
_type
and BINARY to_id
Option 2 would probably be wasteful in terms of disk space given that we don't have good compression available for binary doc values (and it's hard to implement given that the values can store pretty much anything).
Options 3 and 4 have the benefit of not having to duplicate information if we also want to have doc values on
_type
and_id
: we could even build a BINARY fielddata view for_uid
.Then the other question is whether we should rather use sorted or binary doc values, the former being better for sorting (useful for the consistent sorting use-case) and the latter being better for value lookups (useful for random sorting).