elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.08k stars 24.84k forks source link

Support in-place update for single-valued non-indexed non-stored numeric doc value based fields. #30433

Open ebernhardson opened 6 years ago

ebernhardson commented 6 years ago

I have a specific use case to push a weekly update to 1B documents of a floating point value. This represents the popularity of an item and is used at query time as part of the scoring calculation. Currently our solution to this is to push bulk updates along with a script that noop's updates smaller than some amount that is a trade off between accuracy and % of index that is deleted and reindexed. Pushing the current value on regular document updates also helps the noop be more effective.

Specifically updatable doc values in Lucene seems like a potential solution, and Solr exposes this (with a variety of caveats). Could a very focused implementation offer the same ability to push single-valued floating point numbers into a document without a reindex operation?

elasticmachine commented 6 years ago

Pinging @elastic/es-core-infra

jpountz commented 6 years ago

We haven't exposed updateable doc values in Elasticsearch because they provide a trade-off that is hard to reason about. For instance say you update a single value of a single document. The next refresh will need to rewrite doc values for the entire segment that contains this document. If it would get exposed, there are chances that such updateable fields would be used for things like view counters, and I wouldn't be surprised that for some users doc values for all segments would need to be rewritten on every refresh, which would certainly cause write performance / scalability issues. I'm not saying we shouldn't do it at all, but it would at least require careful documentation that it isn't like an in-place update. There are other options that could be considered as well, like storing this data in some side-car data-structure so that it doesn't necessarily have to live by the rules of the Lucene index like the need to provide point-in-time snapshots.

nik9000 commented 6 years ago

Thanks @jpountz for explaining so clearly why we don't expose updateable doc values! I imagine a side car that doesn't follow the Lucene visibility rules might actually work for this but might be pretty confusing as well. I'd love for us to have something here because it is an important feature for folks that are concerned with optimizing search relevance based on frequently changing signals. Which feels like a thing we should support.

jpountz commented 6 years ago

Yeah I'm unhappy that some users resort to using things like parent/child to solve this problem, by storing the frequently-changing values in small documents that they later join at search time. It introduces other issues. I wish we provided something better. Let me try to summarize the options that I am aware of:

Use the _update API

Use updateable doc values

Make doc values support stacked updates ie. writes would only write a delta and things would be resolved on read

Side-car data

droberts195 commented 6 years ago

X-Pack Machine Learning currently does something very similar to the workaround in the original issue description when renormalizing anomaly scores. After renormalization we bulk index the result documents where the anomaly score has changed significantly, but leave existing results untouched where the change to the anomaly score is small.

So any changes that are made as a result of this issue could benefit ML too.

jpountz commented 6 years ago

We had a discussion about this feature, here are some notes:

Use-case:

Implementation:

Open questions:

elasticmachine commented 6 years ago

Pinging @elastic/es-distributed

owenericsson commented 5 years ago

anything update for this issue ?

jpountz commented 5 years ago

No. We are keeping this idea in the back of our minds, but it is very complex to do it right, and it is not obvious whether it would actually help significantly. For instance we believe it still wouldn't be good enough to index counters.

nirmalc commented 1 year ago

Sorry for bumping an old thread. Adding three low-volume use cases for which I had to either use parent/child or do bulk updates.