Heliosearch / heliosearch

The next generation of open source search
http://heliosearch.org
90 stars 19 forks source link

Add new UpdateStream to the Streaming API #37

Open joel-bernstein opened 9 years ago

joel-bernstein commented 9 years ago

The UpdateStream will send Updates to a SolrCloud Collection. UpdateStream will wrap a TupleStream. As it iterates the TupleStream it will send the Tuples to be indexed as documents in a SolrCloud collection. This will allow developers to build new data sets by combining and transforming TupleStreams.

Documents will be routed directly to the correct SolrCloud leader using techniques similar to CloudSolrServer. The actual documents will be sent using the ConcurrentUpdateSolrServer so updates can be Streamed rather than batched.

The UpdateStream can wrap any TupleStream. So it can wrap custom TupleStreams that pull data from other data sources such as RDBM's or NoSQL engines. This provides a generalized streaming ETL framework.

joel-bernstein commented 9 years ago

Added initial implementation to the helio_ustream branch. https://github.com/Heliosearch/heliosearch/commit/fdf85a0eb8ce399de39f6df5c99d839752dec5e1 Not working yet but gives the basic idea. The initial code uses CloudSolrServer as the indexer.

Next step is to work on the Tuples that are returned from the read() method after each batch. These tuples will report on the progress of the indexing. I think it makes sense to report the number of batches indexed, in the queue and error counts.