dotCMS / core

Headless/Hybrid Content Management System for Enterprises
http://dotcms.com
Other
865 stars 467 forks source link

Refactor Reindex Queue for single thread management of queue db table and queue state #26806

Open spbolton opened 11 months ago

spbolton commented 11 months ago

The ReindexQueueFactory class handles most of the DB queries connecting to the persistent reindex queue. This factory class is called through ReindexQueueAPIImpl the main implementation of ReindexQueueAPI.

The problem is that although we have a single ReindexThread class and thread that attempts to read and process the items in the dist_reindex_journal table we are not consistently updating the table with the same thread. Each separate thread will act in its own transaction and connection. Reads of this table by the reindex thread will be consistent as far as other transactions having completed before those updates can be processed, but we cannot currently guarantee what thread and transaction is currently updating the table which can cause conflicts.

Some of the api methods are called synchronously and will use whatever connection is in the calling method. Some are called in a post transaction trigger, that will ensure that changes in the transaction have at least completed and are visible.

One big issue is that the ReindexThread creates requests to the ElasticSearchAPI using an asynchronous method that passes the BulkProcessorListner class this defines actions to run asynchronously before and after the class. In the handleSuccess method it calls APILocator.getReindexQueueAPI().deleteReindexEntry(successful); that directly tries to update the dist_reindex_journal table to remove the entry. The problem is that due to the asynchronous nature this delete is being done on a separate thread for the ElasticSearch bulkProcessor call. The removal and update of this row can conflict with other threads.

Making the BulkProcessor called from the reindexThread synchronous may solve some of these issues ensuring that the removal is done by the main ReindexThread thread. There are still opportunities in calling the ReindexQueueFactory from multiple threads/connections simultaniously causing conflict.

One of the ways of solving this is to

  1. make ReindexQueueFactory to the ReindexQueueAPIImpl and ensure that no other code outside of the scope of the reindex service modifies or reads the dist_reindex_journal
  2. Instead of allowing calls to ReindexQueueAPIImpl methods directly make DB updates we use an in memory thread safe queue like a BlockingQueue to simply accept the request.
  3. I single thread is used to process the requests and update the state in the dist_reindex_journal.
  4. An im memory list of items that are yet to be persisted including wating response from elasticsearch can be maintained to enable de-duplication or conflict. We should not attempt to store all items awaiting reindex in memory as the list can be large, the db should be the source of truth, but the in memory list should handle the items changes that are yet to be persisted to the db. For handling large amounts of updates the thread writing to the dist_reindex_journal can write in batches from the in memory queue rather than individually.

    If an item version 1 is currently being reindexed and in quick succession we create versions 2 and 3 before version 1 is up to date we can't stop the indexing of version 1, but we should not just remove the index entry after which would then prevent version 3 from being indexed. We do not need to index both version 2 and 3 though, as soon as version 1 has finished indexing we can proceed to index the latest version 3.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

fmontes commented 2 months ago

@spbolton do we still need this? is any work done here?