Technical Details of this Project

juv commented 8 years ago

Hi,

I am interested in some technical details of this project. Are there blogs/papers or the like that describe this project more in-depth than the documentation? I feel like the current documentation is actually great already but could take some more technical aspects.

Specifically, I am interested in the following points:

Will the index be held entirely in memory? What if it gets too big? Is there a rule of thumb for which volume of data the index gets too big (obviously index size varies for the amount of indexed fields on a table but still...)?
The index data will be held local on the node where it was indexed on. Why is the index or something like an index summary not get distributed? IIRC, this was a limitation of the secondary index API. Am I wrong when I think that hitting all nodes for a query on a index could decrease performance a lot on a big cluster?
Which parameters of cassandra are the most impacting ones on the performance of the index? Probably the heap space?
How many lucene documents get created on each node? One per partition? One per table?
Does the lookup in the lucene document make use of bloom filters or similar strategies to quickly determine whether a queried value is contained within the indexed data? This might be more related to lucene itself, any blogs/papers about this are appreciated if you got them ready to hand. I got to admit that I am unfamiliar with lucene itself so this might be a kinda stupid question :)
How are Range Queries resolved internally? For example, when I query a timestamp range of 5 minutes of one partition. Will the index return two concrete pointers within the partition that will then be read and evaluated by Cassandra?
Is there a way to visualize the index documents to see the structure of it?

Thanks!

adelapena commented 8 years ago

Hi @juvlarN,

You can get some additional information in this talks:

About your specific questions:

Lucene index is stored on disk, and it doesn't require to fit in memory. You can store until 2^31-1 documents per node per index. Some data structures and caches are stored on Java heap, but most in-memory caches are off-heap.
Cassandra secondary indexes are local indexes by design. This design is fare common in search engines, and you can find the same approach in Solr and ElasticSearch. As a common alternative to hitting all nodes, you can direct searches to specific Cassandra partitions, which is ver useful in well-partitioned data. Also, index locality is an advantage when using MapReduce frameworks such as Spark. Most deployments on large clusters should one of these two approaches.
Of course heap is relevant. Virtual nodes are also important, and you will get better performance disabling them in analytics applications.
One document per CQL logical row.
Lucene search strategies are quite complicated. You could start by reading Lucene documentation. ElasticSearch documentation also uses to contain relevant information about Lucene.
Our index uses Lucene to index the primary keys of the Cassandra rows. User queries are translated to Lucene queries that retrieve an iterator (cursor) over these primary keys, and these keys are used to retrieve the row data from Cassandra storage engine. Specifically, range queries are translated to a Lucene's NumericRangeQuery, DocValuesRangeQuery, or TermRangeQuery, depending on data type and query arguments.
You can use Luke.

I hope this helps.

juv commented 7 years ago

Hi @adelapena,

this answer is great. Thanks! As all of my questions have been answered, I'm closing this.

Stratio / cassandra-lucene-index

Technical Details of this Project #191