Give _uid doc values - Githubissues

jpountz commented 9 years ago

We already use fielddata on the _uid field today in order to implement random sorting. However, given that doc values are disabled on _uid, this will use an insane amount of memory in order to load information in memory given that this field only has unique values.

Having better fielddata for _uid would also be useful in order to have more consistent sort order when paginating or hitting different replicas: we could always add a tie-break on the value of the _uid field.

I think we have several options:

Option 1: Add SORTED doc values to _uid
Option 2: Add BINARY doc values to _uid
Option 3: Add SORTED doc values to _type and _id
Option 4: Add SORTED doc values to _type and BINARY to _id

Option 2 would probably be wasteful in terms of disk space given that we don't have good compression available for binary doc values (and it's hard to implement given that the values can store pretty much anything).

Options 3 and 4 have the benefit of not having to duplicate information if we also want to have doc values on _type and _id: we could even build a BINARY fielddata view for _uid.

Then the other question is whether we should rather use sorted or binary doc values, the former being better for sorting (useful for the consistent sorting use-case) and the latter being better for value lookups (useful for random sorting).

pickypg commented 8 years ago

My vote would be for number three or four:

Option 3: Add SORTED doc values to _type and _id Option 4: Add SORTED doc values to _type and BINARY to _id

With #14783 we already enable doc values for _type, so it makes sense to individually call out the _id as well. This also allows changes to happen to _type without necessarily breaking _id.

In my experience, most users do not use random sorting, but sorting on _id is not very common either. With only sorting in mind, I would expect to see _id used for non-random sorting a lot more than for random sorting. However, for other use cases, such as referencing the _id in other scenarios that use fielddata (e.g., rarely, but sometimes in aggregations, as well as scripts), it may tip it in favor of being binary.

rmuir commented 8 years ago

There is a huge difference between _type and _id when it comes to the expense of docvalues.

_type is a lowish cardinality field. This means if you have only 2 unique values foo and bar, lucene will deduplicate this and write 1 bit per document. If you only have 1 unique type (also common), we will write 0 bits per document, each segment just has in the metadata "all docs have value foo: it costs nothing). So for 10M documents with 2 types, docvalues for this field costs a little over a megabyte.

On the other hand unique ids are high cardinality by definition: deduplication does nothing. Either choice is extremely costly in comparison. Lets consider 10M documents with ids of length in bytes 16 each and make some guesses:

BINARY might you ~ 160MB for the bytes (10M * 16). That is how binary works: its just a straightforward encoding of what you gave it. If the IDs do not have a fixed length, but are instead variable length, then there are additional costs.
SORTED might cost you ~160MB for the bytes (10M * 16) and additional 30MB (10M * 24bpv) for ordinals. The bytes are prefix compressed in this case, because access by ordinal is more important, but for randomish ids this compression will probably not be very efficient. Access to the bytes is also slower, that is the downside of prefix compression (which likely does not help).

I just want to make it clear this is apples and oranges. The fact we turned on docvalues for type is irrelevant when it comes to unique ids. We need very strong use cases and features IMO if we are going to incur this cost.

pickypg commented 8 years ago

Very good info @rmuir, as usual. It makes me think that _id supporting doc values should exist (particularly in light of #15155), but it should be opt-in.

mikemccand commented 8 years ago

I think it's actually 20 bytes for ES's auto-generated IDs (15 fully binary bytes for the Flake ID, and 20 bytes once it's Base64 encoded) ... but, yeah, this would be a big cost ...

rmuir commented 8 years ago

Why do we base64? This probably bloats the terms dict today.

rjernst commented 8 years ago

It makes me think that _id supporting doc values should exist

Why can't a user store this in their own field if they want to do something crazy with it? I don't think we should add back configurability for metadata fields, even if it is just one. It was a lot of work to remove that (#8143), and these are our fields, for internal use by elasticsearch. Edge cases like described in #15155 can be handled by a user field with doc values enabled, if they want to do such a crazy thing.

pickypg commented 8 years ago

But edge cases like #15155 cannot be handled without some other special handling because it's the access of the _id that is the slowdown. Adding a doc value field does not bypass that cost.

eeeebbbbrrrr commented 8 years ago

Hi all! @pickypg linked this issue to me because he knows it's near and dear to my heart.

My exact use case (shameless plug: @zombodb: https://github.com/zombodb/zombodb) is actually what y'all are describing as an "edge case" in #15155 -- that is, ES is being used as a searching index only (ie, store=false, _source=disabled), and an external "source of truth" (Postgres) is used to provide document data back to the user.

While @zombodb might be unique in implementation, I doubt its general approach of providing _id values and using them to later lookup records in an external source is.

An implementation detail is that @zombodb, through a REST endpoint plugin, uses the SCAN+SCROLL API to retrieve all matching _id values, re-encodes them as 6byte pairs, and streams them back as a binary blob.

Against ES v1.7 (and 1.6 and 1.5), benchmarking has shown that the overhead of simply retrieving the _id value completely swamps searching and even the String-->byte encoding ZDB does, so I'm excited y'all are looking at ways to make this better.

(as an aside, I've actually spent quite a bit of time debugging this (against 1.5), and found that if a parent<-->child mapping exists, using its cache to lookup the _id by ordinal (bypassing Lucene, decompressing, and decoding the _id) is nearly an order of magnitude faster. I gave some patches to @pickypg awhile back through my employer's support agreement, but we all kinda decided it wasn't worth the effort of integrating into ES because v2.0 was near and changed everything.)

The idea that such things can "be handled by a user field with doc values enabled" isn't really true, as @pickypg pointed out, because ES is still doing all the work to retrieve the _id value for each hit.

So a half-baked idea would be: What if retrieving the _id could be disabled on a search-by-search basis? Instead, the search request would specify a "user field with doc values enabled" that is a copy of the _id value. Maybe more generally, the ability to elide returning all the fields that are deemed "for internal use by elasticsearch"?

eeeebbbbrrrr commented 8 years ago

So I experimented with this idea (disabling returning _id and _type) against v1.7 (I'm not in a position to work with v2.x yet).

All I did was quickly hack FetchPhase.java to set the fieldsVisitor to null and then guard against that in the places it's used, and hardcoded both the "type" and "id" properties of the SearchHit to the empty string.

I then setup a little benchmark using @zombodb.

With a query that returns 14k documents, retrieving all the "ids" in a SCAN+SCROLL loop:

Stock ES: 17 per second Hacked Version: 120 per second

Of course, all the ids were blank, so it's not very useful!

I then added a doc_values=true field to the index that contains a copy of the _id field. Against the hacked version, I was able to sustain 104 per second. That's about a 6x gain. There's definitely quite a bit of overhead in uid decoding.

In case you care how I hacked FetchPhase.java: https://gist.github.com/eeeebbbbrrrr/9af88e6dc88943450c73

rmuir commented 8 years ago

With a query that returns 14k documents

You should just return the top-N instead. That is what lucene is designed to do.

eeeebbbbrrrr commented 8 years ago

You should just return the top-N instead. That is what lucene is designed to do.

The point is that there's room for significant improvement around how _uid is handled. I was trying to show what the overhead is -- and on my test data on my laptop, it's about 6x. If a reasonable way to improve this can be found, everyone wins.

rmuir commented 8 years ago

Well, lucene just isn't designed to return 14k documents, and by the way docvalues aren't designed for that either. for such huge numbers then a database is a better solution, as it is designed for those use cases.

Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.

eeeebbbbrrrr commented 8 years ago

Just like you wouldn't move your house with a sports car: its a faster vehicle, but its gonna be slower overall.

I don't know how this is relevant.

If y'all make progress towards improving _uid in whatever way, I'd be happy to help test and benchmark changes.

shamak commented 8 years ago

Hey, I stumbled upon this issue while I was trying to do something similar in Elasticsearch. I aimed (ambitiously) to retrieve ~1million documents in under 1 second based on a simple filter query. I noticed the unzipping of the '_id' field was taking a while (~8seconds) using the hot_threads API:

org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsRe
der.java:249) org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:427)
org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:219)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:184)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:401)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:833)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:824)

So I wrote a plugin to stop retrieving the '_id' field, and just retrieve a secondary, integer, doc_values field from the document, specified in the query. I thought this would be super quick but suprisingly, it took almost the same amount of time and now, the hot_threads API showed:

 org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
       org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
       org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
       org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
       org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
       org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:249)
       org.apache.lucene.index.SegmentReader.document(SegmentReader.java:335)
       org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:70)
       org.elasticsearch.search.lookup.SourceLookup.extractRawValues(SourceLookup.java:145)
       plugin.retrievedocvalues.search.fetch.CustomFetchPhase.createSearchHit(CustomFetchPhase.java:256)
       plugin.retrievedocvalues.search.fetch.CustomFetchPhase.execute(CustomFetchPhase.java:189)
       plugin.retrievedocvalues.search.CustomSearchService.executeFetchPhase(CustomSearchService.java:500)

The query I'm using is against a custom endpoint and the body is:

 { "sort": "_doc",
    "_source": false,
   "fields": ["foo"]
    "size": 1000000,
    "filter": {
        "bool": {
            "should": [
                {
                    "term": {
                        "foo": "bar"
                    }
                },
                {
                    "term": {
                        "baz": "qux"
                    }
                }
            ]
        }
    }

The field 'foo' is an integer field which has doc_values enabled on ES version 1.7.1. The weird thing is the aggregation on the field is super quick, but retrieving the data itself is slow.

I guess the underlying point is it may not be that much faster to enable doc_values on the '_id' field since I can't see much of an improvement, unless I'm missing something which someone here could point out?

bleskes commented 8 years ago

@shamak you can use fielddata_fields in your search request to retrieve field values from doc values (or in memory field data). Fields are meant to get stored fields with a fall back to _source (which was removed in 5.x as it is confusing):

GET _search
{
  "fielddata_fields": [ "fieldname"]
}

Note though that getting 10K docs should be done with a scroll rather than getting so many docs at one.

Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz

clintongormley commented 8 years ago

Since we now promote doc values as a possible data storage (next to _source and stored fields) , I wonder if we should support a doc_value_fields entry in the search response. I think more and more people will expect it to be there. /cc @clintongormley @jpountz

That's essentially what fielddata_fields is. We were talking about not using the doc values terminology in favour of in-memory vs on-disk fielddata, although I don't think that's the right tradeoff either. The "fielddata" term has history, and referring to doc values as "on-disk" does them a disservice given that they're usually cached in RAM.

So yes, maybe we should add doc_values_fields (or just doc_values?) as a synonym for fielddata_fields?

jpountz commented 8 years ago

Something else we could consider would be to only store the id and type in doc values and not in stored fields in order to not incur a large increase of index size. The benefit is that we would not need any new option on the mappings. However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.

nik9000 commented 8 years ago

However the fetch phase would have to 3 random seeks instead of 1, which could hurt if the index size is much larger than the fs cache.

I suppose then disabling _source would entirely skip stored fields which is kind of cool.

I suspect the _type is going to be cached super fast, especially if we ever decide to sort by _type. Many many use cases use a single type per index so the type lookup is just metadata. Either way I suspect you'd see closer to 2 seeks than 3. Even still, 2 is much worse than 1.

Another question: do we really need to return the _id and _type all the time? I know I typically just wanted some portion of the _source. Usually, like, two or three fields from _source and a couple of highlights. Anyway, maybe we should allow those to be disabled.

pickypg commented 8 years ago

I like the idea of not always returning those fields as it's unnecessary information in a lot of cases, especially for the single _type use case. We call it metadata, so maybe we should treat it like metadata and only return it when requested (defaulting to true).

jpountz commented 8 years ago

I am fine with allowing some of those meta fields to not be returned, but I tend to like that they are returned by default: it is easy to forget that some things are not available if they are not returned by default, and it makes reindexing easier as you don't have to think about fields that you might need for reindexing: everything is there by default.

jimczi commented 8 years ago

I made some tests to check the cost of adding the docvalues to the _id field. I tried to index 1M documents with one field (_id) and different configurations. I tested 3 configurations:

_id with index=true and stored=true
_id with index=true, stored=false and binary doc values.
_id with index=true, stored=false and sorted doc values. For the generation of the _id I tried all the configurations with UUIDs.base64UUID and UUIDs.randomBase64UUID.

base64UUID

Configuration	Size	Docs/s	Random access (docs/s)	Sequential access(docs/s)
Stored	12 MB	372,000	532,000	716,000
BinaryDV	26 MB	378,000	9,009,000	40,000,000
SortedDV	13 MB	255,000	4,608,000	16,129,000

The binary doc values doubles the size of the index because they don't use any compression. They are very fast for accessing any values and the indexation speed is almost the same as the stored field. The sorted doc values have almost the same size than the stored field, this is due to the prefix compression that they use to store the values. They are also quite fast to access any values but the indexation is slower ( ˜= 30% slower).

randomBase64UUID

Configuration	Size	Docs/s	Random access (docs/s)	Sequential access(docs/s)
Stored	49 MB	332,000	719,000	1,751,000
BinaryDV	46 MB	358,000	8,695,000	38,461,000
SortedDV	48MB	246,000	5,524,000	9,523,000

For the random id case, the size of the index is almost the same for the 3 configurations but the sorted doc values are still slower to index the data.

I ran some benchmark and the extra cost during the indexation for the sorted doc values is the sorting of the dictionary (in this case we need to do it twice, one for the terms dictionary of the postings and one for the sorted doc values). Since each _id is unique I tried to add a way to search on the sorted doc values directly, to do so I just added a file that contains the docID for each _id. It's an extra cost of 4 bytes per documents (it's faster for random access to use a full int instead of a vint or a block compression) and to search a docID it needs to retrieve the ordinal of the term first and then seek/read the docID. For existing _id the search is faster than the one that uses the postings but it can be slower when the _id does not exist and does not share a prefix with the existing ones (the latter case is optimized in the terms dictionary of the postings). I don't know if this can be something we want to explore but I wanted to propose at least one option if the extra cost of adding doc values to the _id field is prohibitive.

jpountz commented 8 years ago

Thanks for testing! The hybrid postings/doc-values idea sounds appealing, but it might be challenging to expose it cleanly? (I haven't thought much about it). Otherwise I am wondering how much LUCENE-7299 would close the gap in terms of indexing speed with SORTED_SET doc values and also that maybe we should implement tome simple compression on binary doc values for such cases (eg. based on the most common ngrams).

rmuir commented 8 years ago

I don't think the idea of trying to use the postings dictionary for the term dictionary will work well (besides practical concerns). It will simply be too slow.

The problem is, they are different data structures (it is like trie versus tree, but the difference is important).

The terms dictionary is optimized for lookup by "String", but the docvalues dictionary is optimized for lookup by ordinal.

The docvalues lookup by term is much slower than the postings one, because its not optimized for that. The inverse is true for lookup by ordinal: the entire datastructure is built around doing this with as little overhead as possible: it can do random access within a block, etc.

Given that even a vint for prefix/suffix length is too costly for that case, I don't think we should introduce a branch per-byte with something like n-gram compression. I have run the numbers for that on several datasets (real data: not artificial crap like IDs) and it only saves something like 25% space for that datastructure, depending on the text: in many cases lower than that.

Its important to keep seek-by-ord fast at the moment, because too much code uses sorted/sorted_set docvalues in an abusive fashion with a seek-by-ord for every document, to lookup the text. Elasticsearch has gotten a little better by incorporating things like global ordinals, but it still has bad guys like its scripting support. There are similar cases for other lucene users and even in some lucene modules itself. Historically, people wrote code expecting this to be "ok" and "fast" with fieldcache/fielddata, because that did no compression at all: not even prefix compression. A lot of this code was just ported to docvalues without addressing this, so we still have to keep it fast.

jpountz commented 7 years ago

We don't want to add an option to metadata fields, and we don't want to make everyone pay the price for doc values on _id so we will have to do without doc values on _id.

elastic / elasticsearch

Give _uid doc values #11887

base64UUID

randomBase64UUID