elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
9 stars 990 forks source link

EsBolt for reading #296

Closed itaifrenkel closed 6 years ago

itaifrenkel commented 10 years ago

Reading from ElasticSearch is not for spouts only. One could use ES bolt to enrich input tuple with data stored in ES. For example given an id emit the document that matches that id. In its most generic form the input tuple would be a valid ES json query, and the output would be the ES json result.

costin commented 10 years ago

Fair enough however that seems a specialized use-case, which while sounds, is not generic enough (in my opinion) to be included in the library. Parameterized / dynamic sources tend to be addressed (for performance and usability purposes) in a customized fashion as oppose to a declarative approach hence why, out of the box the library provides reading and writing and leaves ETL or other processing parts to the data pipeline/user.

itaifrenkel commented 10 years ago
  1. In most (if not all) cases Storm Bolts are stateless and the Storm Spouts reads from a queue. That means that the bolt interact with various SQL and NoSQL products for session handling,data enrichement,and state management. We can start a thread in the Storm user mailing list if you wish to get more. Our specific use case is two storm topologies. One that runs and writes to ES (that's where EsBolt write is needed), and then another Storm topology that queries the result of the first storm topology to enrich the data in realtime. Both have spouts that read from various queues.
  2. The parameterization issue can be solved by specifying a hardcoded elasticsearch template, and populating the variables with a map in the tuple arriving in realtime. Would that solve your other concern?
  3. Reading from EsSpout as if ES is a queue may be a tougher sell. The reason is that unlike ETL (in which order doesn't matter) stream processing is sensitive to FIFO and bulk reading from queue, not to mention CAP theorem tradeoffs and more. I am not saying ES cannot be developed to beat Kafka and Redis3 cluster for being a queue with filters, I am saying it would be a tough sell.
  4. JSON parsing and generation takes CPU cycles. The fact that the spout (or bolt) does both I/O and CPU intensive task means that you would have a conflict. On one hand you would need to adjust the number of bolts/spout to the elasticsearch deployment, on the other hand you would need to increase it in order to perform high throughput JSON parsing. That is why in Storm one would ussually split those into separate bolts. Since I assume that you are using the Http transport underneath, I suggested to expose the Json parsing to the next storm bolt, and not always make it part of the spout.
itaifrenkel commented 10 years ago

See also: https://github.com/ptgoetz/storm-hbase/blob/master/src/main/java/org/apache/storm/hbase/bolt/HBaseLookupBolt.java https://github.com/ptgoetz/storm-cassandra/blob/master/src/main/java/com/hmsonline/storm/cassandra/bolt/CassandraLookupBolt.java

itaifrenkel commented 9 years ago

See also the new proposed JdbcLookupBolt https://github.com/apache/storm/pull/374/files

jbaiera commented 6 years ago

With the addition of the high and low level java rest clients, I am not seeing the benefit to building out a potentially complicated enrichment bolt for Storm/other integrations. These sorts of solutions are never quite generic enough, and often are used by a limited set of users. I am closing this for now, but would be fine with reopening it if there is enough community support. +1's on initial issue post please.