AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

Performance degradation because a memory leak in async operations #358

Closed vjrj closed 4 years ago

vjrj commented 4 years ago

Async updates in Cassandra3PersistenceManager has a memory leak in MoreExecutors call that create new objects unnecessary.

In detail, these MoreExecutors.getExistingExecutorService calls despite of its name that subjects that your a getting some singleton creates new MoreExecutor.Application objects:

  public static ExecutorService getExitingExecutorService(ThreadPoolExecutor executor) {
    return (new MoreExecutors.Application()).getExitingExecutorService(executor);
  }

that is causing the memory leak and the degradation.

The symptoms of this performance degradation is in some long task like process-local-node. After few millions of occurrences, the records/sec start to degrade until the task start to throw exceptions and timeouts trying to update Cassandra. The logs are like this:

biocache-store-1 2019-12-24 15:38:51,039 INFO : [ProcessLocalRecords] - Record/sec:113,  updated:3720000, read:3720001, updateFail:0  Last rowkey: eef5f253-ba82-4ddf-8899-a5f831fac75d  Last 1000 in 88.704
biocache-store-1 2019-12-24 15:40:04,687 INFO : [ProcessLocalRecords] - Record/sec:136,  updated:3730000, read:3730000, updateFail:0  Last rowkey: 0c78644c-9fca-48cb-a7be-8cb37d69a070  Last 1000 in 73.648
biocache-store-1 2019-12-24 15:41:20,234 INFO : [ProcessLocalRecords] - Record/sec:132,  updated:3740000, read:3740001, updateFail:0  Last rowkey: eb235f37-9014-4d0e-b8b9-de1948e8fd4f  Last 1000 in 75.547
biocache-store-1 2019-12-24 15:42:50,302 INFO : [ProcessLocalRecords] - Record/sec:111,  updated:3750000, read:3750001, updateFail:0  Last rowkey: 116bd636-d090-4743-8606-e6b98ee18d10  Last 1000 in 90.068
biocache-store-1 2019-12-24 15:44:24,234 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:24,236 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:44:37,392 INFO : [ProcessLocalRecords] - Record/sec:93,  updated:3760000, read:3760001, updateFail:0  Last rowkey: 2dee63c7-0d8b-4e68-86cf-b1769dbec953  Last 1000 in 107.09
biocache-store-1 2019-12-24 15:46:54,044 INFO : [ProcessLocalRecords] - Record/sec:73,  updated:3770000, read:3770001, updateFail:0  Last rowkey: 95d4c87c-1209-4dd3-b5fc-4963cec07c73  Last 1000 in 136.652
biocache-store-1 2019-12-24 15:49:42,961 INFO : [ProcessLocalRecords] - Record/sec:59,  updated:3780000, read:3780001, updateFail:0  Last rowkey: e1fef9f5-93be-47d0-b609-1ed9d3c13745  Last 1000 in 168.916
biocache-store-1 2019-12-24 15:54:29,627 ERROR: [Cassandra3PersistenceManager] - Exception thrown during paging. Retry count 0 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:54:29,628 ERROR: [Cassandra3PersistenceManager] - Backing off for 5 minutes. Retry count 1 - All host(s) tried for query failed (tried: biocache-store-1.gbif.es/127.0.0.1:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response))
biocache-store-1 2019-12-24 15:55:08,262 INFO : [ProcessLocalRecords] - Record/sec:31,  updated:3790000, read:3790001, updateFail:0  Last rowkey: aa1f460c-7af6-4293-86cb-d299fafcb1f0  Last 1000 in 325.302
biocache-store-1 2019-12-24 16:06:54,645 INFO : [ProcessLocalRecords] - Record/sec:14,  updated:3800000, read:3800000, updateFail:0  Last rowkey: c4508b46-6a47-48e2-82f1-c4a1b72f6059  Last 1000 in 706.382
biocache-store-1 2019-12-24 16:14:47,653 ERROR: [ClassificationProcessor] - Exception during classification match for record 094ea831-af1d-435d-b141-daa21badb17f
com.datastax.driver.core.exceptions.OperationTimedOutException: [biocache-store-1.gbif.es/127.0.0.1:9042] Timed out waiting for server response
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:43)
        at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:25)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:293)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)

This was reported by @shahmanash in this slack thread but we were also suffering in our platform.

Some screenshots of visualvm headdumps of biocache command when this happens: Captura de pantalla de 2019-12-27 11-43-32

I've tested that just moving this getExsitingExecutorService out of the putAsync function solve the issue, and long tasks ends without this performance degradation (and only one MoreExecutor.Application object is created. I'll send a PR in short. After patching:

Captura de pantalla de 2019-12-30 18-27-27

This only happens when cassandra.async.updates.threads > 1.

Other uses of MoreExecutor should be verified to detect some similar problems.

vjrj commented 4 years ago

Related https://atlaslivingaustralia.slack.com/archives/CCTFGEU1G/p1584961155087800

peggynewman commented 4 years ago

Merged, and running in production seemingly without error.