elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.65k stars 24.38k forks source link

"block until refresh" indexing option #1063

Closed caravone closed 8 years ago

caravone commented 13 years ago

Feature request: Provide an option to the index operation that will wait until the next scheduled refresh occurs before returning a response. After the response is returned, all documents indexed in that operation should be visible for search.

ghost commented 13 years ago

Thanks caravone for entering the issue.

This is discussed here: http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/4b882c79671c6e2c/e74e3fc1718b6bf6?lnk=gst&q=visible#e74e3fc1718b6bf6

Btw, we can work around the issue by waiting > 1000ms after an index op before doing a query. But having this at the API level will be great.

ghost commented 12 years ago

Has this issue been prioritized? This would be really great and would allow ES to be used in many more use cases. Currently, ES cannot be used to fetch 'screen' data because of the NRT aspects. As you mentionned Shane, a flag that blocks the index operation until the changes have been made visible by the next refresh would do what we need and you indicated that the complexity of implementing this would be manageable.

Thanks again for this great product!

NateHark commented 12 years ago

We could definitely benefit from this feature as well. NRT works great for 95% of our use cases, but we still have a couple of use cases that would be simpler if this feature was available. Thanks!

ghost commented 12 years ago

We are simulating this behavior doing BulkRequestBuilder.setRefresh(true). That's fine while we are still in development... but this won't scale in production.

BulkRequestBuilder.setBlockUntilRefresh(true) would be a great addition to the request API.

Is there an easy way to have it block until the replicas have also been updated. The next query could be routed to the replica instead of the master an need to behave the same if blocking has been requested.

Thanks again Shay!

ghost commented 12 years ago

Hello Shay!

We are getting closer to a beta release and this feature would really make a big difference... and I believe that ES would gain in its ability to become more and more the primary datastore for many applications if we could optionally block on indexing until refresh.

We understand that this behaviour will be used only where needed as this will introduce a mean latency of 500ms for the call. But this would be really helpful in many usecases.

Thanks for the great product. It's been really living up to our expectations.

Remy

ghost commented 12 years ago

Me again! ;-)

While we are backed by a relational database for security and for its transactional support, we rely almost entirely on ES for all of our queries. Not just for searching things... all our relations between items, all our acls and security is done through queries to ES.

For some use cases, it would be just great to have index-then-block-until-data-is-visible-for-search... instead of forcing a refresh.

BTW, when doing BulkRequestBuilder.setRefresh(true), does this also forces a refresh on the replicas?

Thanks!

missinglink commented 11 years ago

+1

richardwalsh commented 10 years ago

+1

fresheneesz commented 10 years ago

+1

teuneboon commented 10 years ago

+1

onnomarsman commented 10 years ago

+1

stabenfeldt commented 10 years ago

+1

nessup commented 10 years ago

+1

pentium10 commented 10 years ago

+1

mikend commented 10 years ago

+1

kevin-montrose commented 10 years ago

+1

LiquidMark commented 9 years ago

Hoping this shows up soon. We use the 'refresh' option to all of our index operations. Fine for development and testing, but I'm concerned about performance impact for production. Blocking until the next scheduled refresh has finished would solve our problem!

NTCoding commented 9 years ago

This is killing us in production. We need to calculate facets on the indexed documents. but our servers cannot handle many refreshes per second.

I'm looking for some way to get a notification of when the items have been acknowledged so I can wait until then and calculate facets.

So another +1 :)

clintongormley commented 9 years ago

From https://github.com/elasticsearch/elasticsearch/issues/7354#issuecomment-52814558

Yeah ControlledRTReopenThread was designed for exactly this situation; it's basically the same as your 3rd option (Delay the update of the view in the UI...): it delays the rare requests that must see the latest changes while allowing the normal (hopefully vast majority) of requests to just use the last refreshed reader.

ghost commented 9 years ago

Unfortunately we had faced this problem. So +1 :)

satazor commented 9 years ago

+1

clintongormley commented 9 years ago

A problem with this solution is that we could have potentially thousands of requests blocking until a refresh happens, all of which suddenly return in the same instant. All of those requests will use up RAM, and the sudden rush of responses might saturate network bandwidth.

I'm not convinced that the suggested solution here is practical. Better to take an "eventually consistent" approach, eg:

fresheneesz commented 9 years ago

@clintongormley Your concern sounds rather far fetched. All the request has to do is close. You're really saying that thousands of TCP closes would "saturate network bandwidth"? I don't think that would even happen if your server's primary connection to the db is over 56k. And how much ram and resources do you think it takes for something to continuously poll? WAY more.

clintongormley commented 9 years ago

@fresheneesz that's an interesting idea - i was thinking of only sending the responses once the refresh happened, as opposed to just holding back enough to stop the request from completing. It wouldn't be sufficient to just hold the connection open as it uses keepalive anyway.

joar commented 9 years ago

I'm +1 on this, as long as it implies that you may leave the request hanging until the index is flushed (i.e the document becomes available).

jannemann commented 9 years ago

+1

mrkamel commented 9 years ago

+1

bleskes commented 9 years ago

we discussed it and I think the concern about memory is valid but it's one we already have. During master election we accumulate indexing requests as well (up to 1m). There was also the idea of add a timeout to the refresh wait (but not the indexing).

javanna commented 9 years ago

I think people might be currently using the refresh option on the index api to achieve the same result (finding the document via _search straight-away), which is scary as it's not feasible to refresh for each single index operation. Having the option to wait for refresh would be much better I think as it wouldn't cause a refresh but just wait for the next one, and it could potentially replace the ability to refresh on indexing completely, or just be its lightweight alternative.

spudbean commented 9 years ago

An alternate approach to "block until refresh" is "return a token that I can use to force consistency on my next query". On a PUT, ES could return some opaque consistency token. On subsequent queries, I can pass a "be consistent with" parameter and give it the token from my last PUT (or several such tokens). ES will block until the "refresh" for that token has occurred (it may have already occurred!) and then execute the query.

Note that this blocking can be done per-shard, some shards may already be up to date wrt to the token when the request come in, and only some may need to block. You could even imagine some future optimisation where requests are routed to shard-replicas that are more likely to be up-to-date wrt to a particular token.

So we change from "block a PUTs return until refresh" to "block a QUERYs start until refresh". This allows for better pipelining of activities in the system as a whole.

(EDIT: it is worth noting that if you can implement "block a QUERY's start", then you could easily implement "block a PUTs return" on top of that within ES, or within client code.)

nik9000 commented 9 years ago

I think the right way to do this is to allow waiting on some change to become visible but advise people to use a very very short timeout. If you keep the timeout to a couple of hundred milliseconds then your unlikely to consume a ton of ram or bunch requests too too tightly. And for folks that have stupid amounts of traffic they can just not use the feature or set the timeout even lower. Or to 0. And for folks that have 10 queries per second, well, they can set the timeout to 10 seconds no problem.

The big thing here is that everyone will have to be able to handle the timeout outside of Elasticsearch - in the client for most sites and at the user for the largest sites.

fresheneesz commented 9 years ago

Our use case had almost nothing to do with user-related queries - we just needed to do it once, every once in a while, for the entire application. It would also be indespensible for unit testing, where you need to make sure data's been saved and is accessible before re-accessing it.

roytmana commented 9 years ago

+1

nik9000 commented 9 years ago

It would also be indespensible for unit testing

Just issue a refresh call then. Its how Elasticsearch's tests do it.

Funbit commented 9 years ago

+1

glammers1 commented 8 years ago

+1

sandermarechal commented 8 years ago

Is there still no solution to this after 4 years? I'm facing the exact same thing. User adds some item and is redirected to the list. The list comes from ES and should show the new item. The solution proposed in #7354 sounds perfect for this.

Funbit commented 8 years ago

+1

haroldo-ok commented 8 years ago

+1

agdevbridge commented 8 years ago

+1

apepper commented 8 years ago

+1

nik9000 commented 8 years ago

Replying to my own comment:

I think the right way to do this is to allow waiting on some change to become visible but advise people to use a very very short timeout.

Right now, and as far as I can tell any non-ancient version of Elasticsearch, this exists in the non-realtime version of the get api. The "timeout" is always 0. Which isn't convenient but it gets the job done and its how I'd advise folks to implement this if they had lots of this kind of traffic to prevent thundering herd issues anyway.

The way you do it is to store the version returned after you updated the document and the version on which you based your update. Then you poll Elasticsearch, say every 100ms, with the non-realtime flavor of the get api. At this point there are a few things that can happen:

  1. Elasticsearch returns an old version - it hasn't been refreshed yet so keep polling unless you've timed out.
  2. Elasticsearch returns the version you asked for. You are done - it is ready for search.
  3. Elasticsearch returns a newer version than you asked for. You are done but you might want to let the user know that someone else has edited the document in the mean time. Hopefully you are using optimistic concurrency control so the extra edit didn't stomp your user's edit.
  4. Elasticsearch returns a version earlier than the version on which you based your update. The document was probably deleted and recreated while you were waiting. This won't always be possible to detect in all cases so the timeout in number 1 is important.

The create case is just like the update case but without the old version.

The delete case requires that you poll, waiting for non-realtime get to find nothing or a version higher than the version that the delete api returned. If your polling really missed the mark its possible for the version number to go backwards (you delete, delete is refreshed, all copies of the document are flushed (out of the translog, I think), and someone recreates). This should be pretty rare and it is a case you can handle by telling the user that someone recreated the document after their delete.

This isn't perfect, and its quite a bit of work to do, but it gets the job done. I've put together a gist that demonstrates it. I think I've covered all the case but I wouldn't be surprised if I missed something.

clintongormley commented 8 years ago

@nik9000 ++

Thinking that we should add this to the docs.

nik9000 commented 8 years ago

Thinking that we should add this to the docs.

Totally. Is it enough to close the issue, do you think? I get the sense we could make this easier, maybe even adding an es-side timeout.

nik9000 commented 8 years ago

I'll add this to docs, btw.

nharraud commented 8 years ago

@nik9000 I had proposed a similar solution some months ago. http://stackoverflow.com/questions/28111805/elasticsearch-block-until-refresh-wait-for-doc-to-be-searchable-alternatives I don't think you missed any point. I have been using this for some time and had no problem (appart from using more resources)

Funbit commented 8 years ago

@nik9000 How does the proposed solution can be used with bulk update? Obviously we can't check document version because there are many documents having their own version, updated asynchronously..?

nik9000 commented 8 years ago

@nik9000 I had proposed a similar solution some months ago. http://stackoverflow.com/questions/28111805/elasticsearch-block-until-refresh-wait-for-doc-to-be-searchable-alternatives I don't think you missed any point. I have been using this for some time and had no problem (apart from using more resources)

Nice! I don't know if the realtime=true is required in the update case because if the document isn't found at all then its been deleted. In the create case that makes sense though.

I wonder how this works in the presence of replicas too - all this does is validate that the document is visible on one the replica you hit with the request. Replicas are refreshed asynchronously so I don't know that this is good enough - just better.

@nik9000 How does the proposed solution can be used with bulk update? Obviously we can't check document version because there are many documents having their own version, updated asynchronously..?

I suppose you could just run the process once per operation in the bulk the process, checking with mget but that is a lot of state to maintain for large bulk sizes. My understanding of this use case is that it is to make sure that users see their edits after they make them so I wonder if its important to be able to scale to thousands of edits at a time.

@nharraud and @Funbit, can you describe what you are trying to do? Like, if its not private or secret or anything. I'm not sure I'm keeping the right model of this use case in my head.

I wonder if we could do better somehow - send some kind of per shard consistency information that we could check at query time and bounce the query back to the client if the shard isn't ready.

I still think @nharraud's solution is worth adding to the documentation but I'm more and more sure that it isn't enough.

nharraud commented 8 years ago

@nik9000 Yes it was for a creation scenario. You might be right for the replicas. My project is in prototype stage with quite a good network and hardware so It is always querying the same es node without any load balancing.

My use case is quite simple. I have a page where I display a filtered list of people applying or being invited to a project. I want the list of invitations to be updated when the project owner invites someone, so that he sees that his action succeeded. I just need to know that the document has been indexed and that I can update the list, I don't need to be informed if somebody else modified or deleted it in the mean time. The list must just reflect the new state of the index, not the one before the invitation was sent.

This limitation is the reason why, in the near future, I will use the main db for listing these invitations when no filter is set, instead of querying elasticsearch. It will be updated faster and won't poll es.

@clintongormley proposed a solution some comments above but I'm not a big fan of it. Complicates code without need. With this solution you need to either

I had also proposed to check shards state but the answer was no: #9395 I hope you will find better arguments than me ;)

nik9000 commented 8 years ago

OK! I think I have a nice solution. Its mostly @clintongormley's idea.

  1. Whenever we serve an index request we return the current elapsed time (not wall clock time, it can go backwards).
  2. We collect those times into a map when on the response response.
  3. We allow the client to throw the map back at us. If they do then we'll check that the last refresh for the shard included that time. If it doesn't we'll bounce the request back to the client.

I think this will work so long as we update the "last refreshed checked time" on the shard every time we check for a refresh, even if there was nothing to refresh. That will cover the cases when the last refresh was triggered at the same time as we record the time for the write operation and the refresh contained the operation.