Table Insertion / Update : Index Performance

karankap commented 7 years ago

While going through the documentation, came across the following "Indexing is done in a synchronous fashion at the storage layer, so each row upsert implies a document upsert".

Does this mean, any insert / update to the Cassandra table will only return success if the Index is also updated? Is this an atomic operation?

Also, there is an attribute "_refreshseconds" that can be provided during index creation? The documentation explains this as "number of seconds before auto-refreshing the index reader. It is the max time taken for writes to be searchable without forcing an index refresh. Defaults to '60'."

Does this mean, any update to the indexed field will not be searchable unless the Index Reader has refreshed the index. Using the default value, the Index Reader will run after 60 seconds.

Isn't the above 2 statements contradict each other?

Kindly confirm on the behaviour of the Index.

jpgilaberte commented 7 years ago

Hi @karankap,

Thank you for your interest in the project. I'll try to answer you.

Yes, any insert / update operations on the Cassandra tables will only return successful if the index is updated. Yes, this is an atomic operation.

In cassandra-lucene-index we implemented a mechanism (provided by Lucene) to control changes made on the index. In order for the process of reading (IndexReader) to be aware of the changes made by the writing process (IndexWriter) it is necessary to perform a re-opening operation of the index. 'Refresh_seconds' is the maximum time that elapses after each reopening operation without forcing an index refresh.

As you can see, these two concepts are not contradictory, one refers to the update / persistence of the index and the other refers to the ability to read the updates of the index that due to internal issues of the technology in question, works in this way.

Hope this helps Regards

karankap commented 7 years ago

Thanks @jpgilaberte-stratio. I have few follow-up questions, if you could please confirm:

Is it possible to configure the behavior of the Index such that the Updates are reflected immediately? I know there is a parameter "refresh_interval" which can be kept as low as "1", however, is there a way that the IndexReader gets the update in real time?
What is the impact of keeping "refresh_interval" as 1 seconds on the performance of read / write queries?
What is the significance of "indexing_threads" attribute? The documentation says that it is the number of asynchronous indexing threads, by default same as the number of processors available to JVM (which will be at least 1).

Regards

jpgilaberte commented 7 years ago

These are very good questions @karankap.

You have the possibility to force the refresh of the index as explained here.
There is a small impact but it is not a very heavy operation.
We use a parallel processing queue for writing / indexing requests. indexing_threads is the number of executor threads.

Hope this helps Regards

karankap commented 7 years ago

Thanks @jpgilaberte-stratio for the responses. These do really help me understand the plugin better.

Further on the write use-case, you mentioned that the operation is atomic. So in the case when the write to Index fails, what happens underneath? Does the cassandra goes into retry mode until the write to Index succeeds OR Is the error thrown back to the caller / application? What happens to the commit log entry?

If you could please help with the write path / order that is followed i.e. On a document upsert the write path may be something like: a) Update commit log b) Update MEM table c) Update Lucene write buffer (sync) d) Update Lucene Index (async)

Additionally, what shall be the size of write buffer for Lucene Index? I was thinking of keeping it similar to commit log size, but it would help if you could please confirm.

We are working on an application where we want to introduce this plugin, thus wanted to be aware of these details. If there is a detailed documentation link that has all these info please point me there, I will have a look. The github documentation, although is quite elaborate, however, doesn't provide these details.

Really appreciate your help till now.

Thanks.

karankap commented 7 years ago

@jpgilaberte-stratio - could you please confirm on the above queries?

jpgilaberte commented 7 years ago

@karankap - Index writing is a step of writing flow (local to the node) of Cassandra and shares the mechanisms that guarantee the consistency of the data and the index (as if it were a native index). If an error is generated in the write flow, it is propagated to the client.

a) Update commit log b) Update MEM table c) Update Secondary Index (native or not) d) Update Cache (if apply) e) Update Lucene IndexWriter

They are two different things and the sizes of the occurrences to store are not the same. Depending on the number of columns to index, the occurrences in the IndexWriter buffer will be larger or smaller. The configuration is not an exact science and depends a lot on the use case and the resulting experience. For starters, the default settings can be a good starting point.

The documentation is made to cover the explanation of the capabilities of the plugin and to abstract from the difficulty of the implementation at low level. Right now, if you want details of this type, I can only forward you to the source code and the documentation of Cassandra. Anyway, we think it is a good idea to have more advanced documentation covering low-level details.

Hope this helps Thanks

karankap commented 7 years ago

@jpgilaberte-stratio - these gave really good insights into the plugin.

Many thanks for your help.

Stratio / cassandra-lucene-index

Table Insertion / Update : Index Performance #357