elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
992 stars 24.82k forks source link

Changes API #1242

Open Vineeth-Mohan opened 13 years ago

Vineeth-Mohan commented 13 years ago

There should be an integration point for ES and external application where the external applications should be notified of any document changes or updates that happens in ES.

CouchDB have a good implementation on it and it would be great if ES can also incorporate something similar or same.

CouchDB change notification feature - http://guide.couchdb.org/draft/notifications.html

rufuspollock commented 13 years ago

Hi, I want to register a big +1 on this.With the versioning system now in place in ES I imagine this should be possible and would make a lot of stuff a lot easier (from the simple such as generating RSS/Atom feeds to the more complex such as syncing between distinct federated ES clusters).

Some questions for implementation:

kimchy commented 13 years ago

@rgrp: agreed on the need, versioning plays a part in this, but there is still a lot to be implemented to make this happen. A note on what you said regarding changes, I agree that there should be a _changes feed for an index, and across all the cluster. But, what you noted was _changes feed per type (/twitter/tweet - twitter is the index, and tweet is the type), and one per index (/twitter/).

Vineeth-Mohan commented 13 years ago

Dependent on issue #1077

derryx commented 13 years ago

I would prefer a solution where I can hook in and get informed by Elasticsearch about events rather than polling on a _changes URL.

Vineeth-Mohan commented 13 years ago

Hope this is similar to what you are looking for - http://guide.couchdb.org/draft/notifications.html#continuous

rufuspollock commented 13 years ago

@kimchy: thanks for correction on terminology :-) and appreciate this may not be straightforward (big thank-you for all your great work so far).

@derryx (and @Vineeth-Mohan): agreed that one wants push rather than pull notifications like continuous notification in couch. However, this may be harder to do with a java-based backend rather than an erlang one as in erlang it's not really a problem to keep a permanent http connection open with the client.

derryx commented 13 years ago

Tomcat has something similar for Ajax push to the browser. They call it "comet-call" because of the long "tail": http://tomcat.apache.org/tomcat-7.0-doc/aio.html#Comet_support

So it should be no problem to support this with Java.

derryx commented 12 years ago

I have coded a plugin that provides change information. It is a first start and will be extended in the future. You can find it here: https://github.com/derryx/elasticsearch-changes-plugin

Vineeth-Mohan commented 12 years ago

@derryx - thanks a ton man. this looks cool.

jprante commented 12 years ago

If you consider client connections to a _changes API for notifications, a performant, scalable alternative to Comet is WebSocket. Implemented already in netty, and Elasticsearch uses netty :)

derryx commented 12 years ago

The cool thing about websockets is that they are bidirectional. This is not needed here. A persistent HTTP-connection is good enough. The problems currently are more that the current HTTP-transport of ES does not support persistent connections and to get all the changes from ES.

kimchy commented 12 years ago

@jprante the websockets part is cool, and can definitely possibly be used as way to stream changes, but the harder part is building the whole changes infrastructure...

jprante commented 12 years ago

One more thought. WebSocket is also available via XMPP, and XMPP is a robust solution for a distributed notification infrastructure. So how about including a simple lightweight websocket client into each ES node for sending notifications via XMPP? Maybe with the help of Atmosphere https://github.com/Atmosphere/atmosphere ? API doc for an example Websocket pubsub can be found here http://atmosphere.github.com/atmosphere/apidocs/org/atmosphere/samples/pubsub/WebSocketPubSub.html

augustine-tran commented 12 years ago

+1

slorber commented 12 years ago

+1

JohnnyMarnell commented 12 years ago

+1

adorr commented 11 years ago

+1

mbbx6spp commented 11 years ago

:+1:

otisg commented 11 years ago

+1 for @jprante's websocket idea: https://github.com/elasticsearch/elasticsearch/issues/1242#issuecomment-4974916

Spredzy commented 11 years ago

+1

slorber commented 11 years ago

Btw just to understand: what's the benetifs of using websockets? Isn't a "normal socket" enough?

Do you need to receive the notifications in the browser? Does this mean that your ElasticSearch http port is open to anyone?

jprante commented 11 years ago

@slorber Websocket is a transparent protocol extension of HTTP that upgrades HTTP into a "normal socket" where you can do communication in async / realtime mode and push style instead of poll. You can serve both HTTP and Websocket on one port, because clients send upgrade requests to let the communication switch from HTTP to Websocket.

Note, Websocket is part of HTML5 http://www.w3.org/TR/websockets/

In the browser you use Websocket with Javascript very easy with something like var socket = new WebSocket("ws://host:port/path"); and you receive notifications with onopen, onmessage etc.

Because Websocket uses the same port as HTTP, your Elasticsearch HTTP port would not be different to the current behavior.

slorber commented 11 years ago

I understand that, but do you really want to receive change notifications from your JS stack? This means the http port of elasticsearch should be opened to the outside world? Or one should implement it server-side with NodeJS? Ok, I remember having seen a Java websocket client some times ago.

What I mean is: if the standart usecase is to receive change updates on the server side, why do we need to use WebSocket instead of a non-HTML event transmission technology?

jprante commented 11 years ago

ES has a transport protocol layer (Java binary format) so change notifications could be implemented with Java straight forward, for example by using a pubsub technology (where Websocket with Netty is also an option).

HTTP is meant for easy consuming ES requests and responses by REST, using languages / technologies which are not using the internal Java transport protocol. It is enabled by default, but is optional for ES. Upgrading HTTP to Websocket would be a very easy method to help implementing a change notification service also consumable by Ruby, Python, Perl, Javascript etc. just like in native Java transport protocol. I think ES API should follow this polyglot approach.

In most situations, ES production is placed in a private network / behind a firewall / reverse proxy / load balancer so delivering services to the Web is out of the scope of ES. This is also true for change notifications, but the communication mode will get bidirectional. There should be external application logic that can process the raw ES change events in the requests and responses for disseminating them to the web. But, if you prefer, you can also pass external Web requests and responses transparently to ES.

Can you be more specific about "non-HTML event transmission technology?" Websocket is not a HTML technology, it's just a raw TCP/IP socket usable by web applications in bi-directional mode, and this was embraced by W3C.

slorber commented 11 years ago

I think ES should follow the polyglot approach too.

Since ES is placed on a private network, I guess the browsers won't consume that change stream, and I wonder if there's not another polyglot event-transmission technology which could be more appropriate than websocket.

I don't know these stuff so much but AMQP, Thrift, Protobuf and polyglot stuff like that aren't eligible as well for the implementation of this feature? Isn't there any non-HTML technology that solved this problem efficiently before websockets?

brusic commented 11 years ago

Thrift and Protobuf are more for message serialization and not for app communication. There actually is a Thrift plugin for ElasticSearch. Most queuing system rely on an additional application to be installed and maintained.

The challenge in finding a solution is crafting one that supports every client (language) platform. Raw sockets are tough. Websockets might be non-HTML, but I haven't seen any uses outside of browser communication. Then again, I haven't looked into it much.

jprante commented 11 years ago

@slorber It is very desirable to receive ES change notifications in the browser. Many ES programmers are active in web development, they live inside the browser, and that is very good. I love the Chrome Sense Elasticsearch plugin for example. Think of dynamic updates with jQuery, AngularJS, and the like. You can set up transparent Websocket proxies for routing change notification requests and responses easily.

AMQP is a message queue protocol. You may have noticed that ES already offers a RabbitMQ river. I can't see how extra message queues could be a base technology for ramping up ES change notification streams. It depends on the implementation but I do not see the advantage how an extra message queue system can keep up the performance when hundreds or thousands of ES nodes send notifications. Even the events of one single node may overwhelm external message queue systems. I think, just to create and receive change notifications from ES, an extra message queue implementation is just overhead. For consolidation, you have already the ES cluster model with the client node that waits for the response to the requests sent. The client should decide per parameter if changes should be received from the local node, from the nodes of a specific index, or from the nodes of the whole cluster.

There is already an ES Thrift plugin to replace the HTTP transport. Thrift is a data type language for cross-language RPC services, like Protobuf and Avro. For creating a language you must specify an RPC service for change notifications, and this will substitute more or less the JSON and the REST on the wire. In summary, HTTP, Websocket, Thrift, Protobuf, Avro are just transport technologies. They are exchangeable, so they should be not specific about how ES change notification are implemented. My point was, Netty HTTP is already in ES, and that's why Netty Websocket is an interesting option. I've already implemented Websocket as an ES transport some months ago :)

yannnis commented 11 years ago

+1

ghost commented 11 years ago

Hello all!

I'm currently working on replacing Netty 3.6 in ElasticSearch with Netty 4.0. The goal is to first recreate all original functionality in ES's current HTTP implementation, then follow up by adding WebSockets (which is made far easier in 4.0)

A few things: 4.0 is not yet fully released. I don't intend on merging until:

A. everything I do has been tested B. Netty 4.0 reaches stable.

I may make a fork before merging.

I've fixed the following packages:

org.elasticsearch.bulk.udp org.elasticsearch.common.bytes org.elasticsearch.common.compress org.elasticsearch.common.compress.lzf

To do:

org.elasticsearch.common.netty org.elasticsearch.http.netty org.elasticsearch.transport.netty

The packages that were complete were minor modifications. The ones currently up next are more difficult, so would take me more time.

-Cris

ghost commented 11 years ago

Hello again!

I found a way to accomplish streaming, although it really only fits my use case. It requires that your data is being streamed by a river, and you're using a pub/sub system:

When you make an IndexRequest, you can specify a percolate field, and it will return matching percolators. Exploiting this, I altered the CouchDB river (where my data is coming from) to include a percolate field in the IndexRequest, take the percolation results, and to forward to a pub/sub system (in my case, Redis). From there, it pipes into my Node.js instance that's using ES, which distributes it via WebSockets to the appropriate clients.

The advantage of this is that I didn't have to alter ES in any way to support it. I just made a very specialized adapter. The disadvantage is that the publishing system has to remove the percolator when nobody is subscribed to it, so I had to alter Redis to send a DELETE to the appropriate percolator when a channel has no more subscribers.

This also doesn't solve the issue for most use cases, it just happens to fit perfectly into my infrastructure.

I have been trying to replace the Netty library on ES with a newer implementation of it, but I'm still not entirely sure how it will handle WebSockets. What is clear is that a lot of changes would have to be made to support it.

-Cris

matiwinnetou commented 10 years ago

Anything progressed in this ticket since then?

Mpdreamz commented 10 years ago

An alternative to XMPP/pubsub/websockets could be to simply register webhooks

https://webhooks.pbworks.com/w/page/13385124/FrontPage

These would essentially be a fire and forget HTTP POST's from elasticsearch's side back into the application.

azubizarreta commented 10 years ago

This feature would make Elasticsearch on top. It would be possible to create really scalable applications. Consider the following use case, a properties portal where users can register and search filters to be notified via email when new properties matching your filters. If Elasticsearch could have this functionality to notify via webhook http://dev.iron.io/worker/webhooks/ and a worker get the document and send it via http://www.mailgun.com/

Any plans with this feature?

jprante commented 10 years ago

A "web hook" is nothing but a HTTP POST endpoint, you can use web hooks with your favorite web client right away, just read the state of your ES cluster via API, and post it somewhere. There is no magic (and that is how Marvel plugin works for example).

Websockets are persistent, bidirectional channels that are "always on" - once connected, they should run forever. This is useful for streaming (bulk indexing, binary uploads) and monitoring apps to show "pushed" events. They use only one socket for request and response, even asynchronously. So they scale much better than request/response HTTP which requires two sockets.

A websocket transport plugin for experimentation is available at https://github.com/jprante/elasticsearch-transport-websocket

otisg commented 10 years ago

just read the state of your ES cluster via API, and post it somewhere. There is no magic (and that is how Marvel plugin works for example).

I thought Marvel listened for changes instead of polling an API... no?

kimchy commented 10 years ago

there is a lot of confusion here, @otisg marvel doesn't listen / poll changes in data, its about stats API, which is different (pretty basic stuff).

again, the question of changes API is a deep one, and something that listens to changes is not a good solution because its not persisted.

azubizarreta commented 10 years ago

Read is polling architecture, i meant Push notifications HTTP POST, this allow integrate with third party jobs schedulers like Iron Worker or Queue Service like IronMQ.

otisg commented 10 years ago

@kimchy if Marvel polls the Stats API for events, doesn't it mean it can miss some events?

kimchy commented 10 years ago

@otisg no, because the stats API are aggregative, and again, stats API are different from data events

otisg commented 10 years ago

@kimchy right, I was referring to data events, not stats API. Data events API doesn't exist, as far as I know, and if it does, it must not be aggregative because it would accumulate more and more events and blow up. On the other hand, if the API just shows the events that happen to be in flight when the API is called, then Marvel will miss events that start and end in-between Marvel's calls to this API. So how would one "listen" for data/cluster/node/etc. events in order not to miss any of them?

jasonkuhrt commented 10 years ago

Hey guys, I'm a senior developer at littleBits that spends a lot of his time in the browser as well as the server. We're working on and very excited about deploying Eleasticsearch for many projects starting with our Cloud team initially.

I would like to point out that this is very true:

Many ES programmers are active in web development, they live inside the browser, and that is very good. I love the Chrome Sense Elasticsearch plugin for example. Think of dynamic updates with jQuery, AngularJS, and the like.

Except replace the mentioned technologies with ReactJS, Meteor, RxJS, BaconJS, etc.

When we see Elasticsearch advertised as real-time we _assume_ Websockets. The web community has been heavily exposed to https://www.meteor.com/, http://socket.io/ etc. If anything, Websockets are just assumed for even basic apps now, real-time is the new default.

One way or another people are just going to expect this. We did, we couldn't, and we're a tad sad, but we were very glad that the issue is open and everyone agrees on the need.

It seems like direct websocket connections to Eleasticsearch would be most performant but @cravergara what does the non-trivial overhead you added feel like? There are also projects such as this now: https://github.com/jprante/elasticsearch-transport-websocket

We want to build apps that push analytics for a user's device in realtime to their chosen apps when they connect (so not persistent), we want to monitor dozens of systems in realtime and debug using browser clients which, we think, should be as fast as SSH'ing into server and tailing a log. You might think its crazy to expect the performance of log tailing and data vis in an app to compare in performance but my point is that with solid advances like ReactJS for state-of-the-art DOM performance and Websockets for low overheard push-based data its actually doable (thanks to the fact that Eleasticsearch is an amazing real-time system to begin with of course!).

Cheers.

jasonkuhrt commented 10 years ago

Also, perhaps relevant, rethinkDB has shipped what they call "changefeeds": http://rethinkdb.com/docs/changefeeds/

They view this feature as playing in the same spaces I just described above: http://rethinkdb.com/blog/rethinkdb-firebase-meetup/

If couchdb was a reference point, perhaps a new one like rethinkDB can give you some ideas assuming any are needed.

jasonkuhrt commented 10 years ago

rethinkDB changelogs are helpful but verbose, e.g.:

{
  'old_val': { 'id': 1, 'name': 'Slava', 'age': 31 }
  'new_val': { 'id': 1, 'name': 'Slava Renamed' }
}

Rather than including full before/after it would be nice if Elasticsearch query subscribers (clients) could, after getting an initial query state, only receive diffs to that query:

{
  "+": { "name": "Slava Renamed" },
  "-": [ "age" ]
}

If there were to be a use-case for the rethinkDB style of change information, maybe it could a mode/option. The diff style would provide the kind of performance that people will be expecting, no?

jprante commented 10 years ago

@jasonkuhrt computing diffs is an extra challenge because Elasticsearch overwrites previous versions of a doc when indexing or reindexing. Extra read of a doc for diffing would add significant burden, it would turn all index operations into upserts (two operations of read-then-write, plus retry in case of version conflicts). Technically a first step would be to manage a list of subscribers and notify them asynchronously with just the doc id and (optionally) with the new doc source when Elasticsearch has performed indexing successfully.

brusic commented 10 years ago

The comment I was referring to has since been edited, rendering my comment obsolete. :)

jasonkuhrt commented 10 years ago

Jörg, I think Jason (please correct me if I am wrong) is referring to changes in the query and not changes at the document level.

@brusic : ) Actually I was referring to changes at the document level. I hadn't even thought about changes at the query level because personally I typically think of clients knowing their own queries (of course) and not caring about other clients' queries. I'm curious though, do you have use-case ideas about changefeeds for queries?

@jprante Ah... ok. So in reality there is no savings by reducing the packet size here. Maybe this would apply for databases like http://www.datomic.com where documents really are immutable and so such information is already available etc.

Technically a first step would be to manage a list of subscribers and notify them asynchronously with just the doc id and (optionally) with the new doc source when Elasticsearch has performed indexing successfully.

I would like to point out that "just the doc id" is akin to what redis does http://redis.io/topics/notifications and we have found that to be a wholly lacking solution for apps. Now, yes, you could build something like RethinkDBs changefeeds ON the basis of keyspace notifications but.. why? Extra IO (network requests) for what should have been a Elasticsearch processing with a single message over a websocket. Maybe "just the doc id" will be enough for certain use-cases but applications are really going to need this:

with the new doc source when Elasticsearch has performed indexing successfully.

jprante commented 10 years ago

@jasonkuhrt doc source might be huge, subscribers might only be interested in certain fields, I think they prefer catching up a series of doc ids before processing them, e.g. with multi get. Processing docs one by one is often a bottleneck. An idea is to push multi get compatible JSON format back to the subscribers (maybe augmented with a server timestamp when doc change was recognized). http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-multi-get.html

jasonkuhrt commented 10 years ago

@jprante That sounds fine, but complementary, not exclusive, to streaming changes.

I see it like this:

jasonkuhrt commented 10 years ago

@jprante FWIW Meteor went through various problems akin to this one: https://www.meteor.com/blog/2013/12/17/meteor-070-scalable-database-queries-using-mongodb-oplog-instead-of-poll-and-diff

jprante commented 10 years ago

@jasonkuhrt there are no trusted clients in ES, the only philosophy is that features should run OOTB without having severe impact on the node resources, which are limited. The ES node should index and search first, and if it can afford it, it may do other things, e.g. publish events to subscribed clients. The problem of providing large source is the memory. Take a handful subscribers, and you can bring a node to a halt, just with some ten thousand docs at bulk indexing, when the subscribers can't keep up with the speed. Diffing sources makes this even worse, because an ES node will need at least double space to do the diff, and diff is an additional operation. This will hurt index and search performance.

From what I read about RethinkDB I understand it can write ~1000-2000 docs / sec and MongoDB as per http://blog.mongohq.com/better-bulking-for-mongodb-2-6-and-beyond/ is at ~4000-5000 docs / sec. Elasticsearch is much faster. I have observed 15000-20000 docs per sec on a single node with sustained bulk feed rate (with tiny docs, it is possible to go up to >50k).

So there are two problems: 1) minimize additional memory pressure at ES server side and 2) handle slow consuming subscribers, either by reducing message size (to a single coordinate index/type/id), or dropping messages, or switching to a persistent logfile-like backup structure of events, where I see Chronicle https://github.com/peter-lawrey/Java-Chronicle as a candidate. Whether subscribers are trusted or not, they must not be allowed to bring the node or the whole cluster down, just because they are slow. And Javascript e.g. is slow, really slow. Meteor uses websockets so this is not really news to me - ES uses Netty, and Netty comes with Websocket support, so this was intriguing me to get it working