Performance issue with count and filter (match)

Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra

Apache License 2.0

600 stars 171 forks source link

@leesangboo already asked for a efficient way to handle counts on large data sets with a lucene filter without bothering the cassandra coordinator. However the issue is still open in the discontinued, old repo (see how i can find count of 'lucene index' ?).

I'm facing the same problem and was wondering if the stratio lucene index could handle something similar to what Solr (Datastax Enterprise) does:

SELECT count(*) FROM test WHERE solr_query = '{"q":"*:*", "fq":"typename:XYZ"}' ;

The request is really fast as "DSE intercepts the query and brings back numDocs from DSE Search instead of actually performing the count in cassandra" (see Solr CQL Count).

Would it be possible with the lucene index to use something like:

SELECT count(*) FROM test WHERE expr(test_index, '{
   filter:  [{
      type: "count?!?!"
   },{
      type: "match",
      field: "typename",
      value: "XYZ"
   }]
}');

Thanks for your time!

Hi @janusd:

We know this could be a great feature. Executing aggregation functions on index. As you mentioned, this is posible with DSE solr but this is not as easy in spite of using the same library Apache Lucene core.

Solr in DSE is a distributed cluster of a lucene index. Data distribution among solr nodes is chosen by solr. (one mapper-> one node), so when you execute a query agains solr, the cluster knows where to ask (like cassandra with partition-directed queries).

Cassandra-lucene-index is one local index per cassandra node indexing primary token range and replicated data. Cassandra-lucene-index does not determine the data distribution. So does cassandra. So, potentially, any query could get result in any node rows subset. There is a need to ask to every node. Apart from that, data consistency and replication in cassandra should be considered. There is a need to reconciliate/repair data in coordinator, not only agregations results.

Apart from that , cassandra-lucene-index does not distribute the queries, so does cassandra. So, to include this, there is a need of some changes to cassandra project in coordinator-nodes communication logic.

Hope this explains why it is not so easy

Stratio / cassandra-lucene-index

Performance issue with count and filter (match) #283