Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
600 stars 171 forks source link

Use of cassandra-lucene-index on high cardinality columns #320

Closed mohigup closed 7 years ago

mohigup commented 7 years ago

Issue Type: Question

Cassandra 3.10 cassandra-lucene-index 3.10

I have use case which requires choosing either cassandra-lucene-index or Elastic Search on Cassandra. I am more aligned towards the Stratio implementaiton as it doesn't require creating/managing ES clusters. However, the below scenario puts the solution implementation in difficult situation.

Cassandra lucene Index is created on high cardinality column e.g. colA in a key space Test (Keyspace has more than 10,000 records) and following query is executed using Cassandra-lucene-index. select * from Test where colA like “1%”

Now, the above query will require hitting all the nodes in the Cassandra cluster as the Cassandra Lucene indexes are local indexes.

So, How does query performance in such a scenario scale well with Cassandra-lucene-index compared to results returned from ElasticSearch Cluster with data feed from Cassandra as data in the Keyspace increases.

ealonsodb commented 7 years ago

Hi @mohigup:

For your use case you need to use a wildcard query.

You are totally right, cassandra-lucene-index is a local index. Data partitioning is performed by cassandra and depends just only on the partition key. So, for non partition-directed queries (i.e select * from table where gender='male'; being 'gender' not a primary key part), no one knows which nodes have the correct data. This forces you to ask all nodes.

This is the worst use case using cassandra-lucene-index, but it is improvable if you partition your table.

I don't know your exact use case but if you can partition the table the query will be much better(ask only the related nodes and better cluster usage).

Hope this helps

mohigup commented 7 years ago

Thanks @ealonsodb My Table Structure is similar to the example provided in the documentation on Tweets and has similar table structure and primary keys ( id INT PRIMARY KEY). The Records size increases every day. Once, this index is created, I will like to perform wild card queries on the index(use case).

So, If I understand correctly, performing wild card queries on such a index(see below) will not have good performance with cassandra-lucene-index if the table partitioning (using partition and clustering keys) cannot be changed further.

CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         id: {type: "integer"},
         user: {type: "string"},
         body: {type: "text", analyzer: "english"},
         time: {type: "date", pattern: "yyyy/MM/dd"},
         place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
      }
   }'
};

Also, does the current implementation support the following features ?:

ealonsodb commented 7 years ago

Hi @mohigup

You will have this "problem" with elastic also, elastic is also a distributed cluster where data resides on nodes. Data partition(shards) in elastic is configurable via shards. You choose your shards to fit your query but if you have two or three different queries, convenient shards for all of them could be incompatibles.

1.- You create one index per table with several fields, you can combine several queries to different fields in the same index.

2.- Cassandra collections and UDTs are fully supported.

We always encourage users to test every potential solution performance.

Hope this helps

mohigup commented 7 years ago

Hi @ealonsodb

Thank Your Reply. I wanted to have a single query across multiple index's to get the results in the Stratio solution. I see its not implemented yet in cassandra-lucene-index. Let me know if I missed anything.

ealonsodb commented 7 years ago

Escuse me @mohigup. Do you mean you want to query against more than one table in one query?

mohigup commented 7 years ago

Yes, we wanted to query against multiple index's (these index's are on diff tables). I think querying all in single request will be faster than sending separate request and then merging the results. Let me know.

ealonsodb commented 7 years ago

Hi @ mohigup:

When querying, combining data from multiple index(and indeed, multiple tables) is called a join function. The join function belong to relational databases world.

Cassandra is not a relational database, it is not SQL. Indeed, it is NoSQL. Cassandra avoids any form of relation from those databases in favour of scalability. Instead, it is based in denormalization.

Its architecture is based on top of this idea. This is not going to change. So, IMHO, performing join operations from different tables is cassandra is not going to happen.

If you show us your complete use case(keyspace, table, index creation queries and every select), maybe, if posible, we can help you to fit it to just use one table.

Hope this helps

mohigup commented 7 years ago

Hi @ealonsodb

Thank You for helping out. We have decided to spend more time researching all options before deciding a solution for our use case.