akka / akka-persistence-cassandra

A replicated Akka Persistence journal backed by Apache Cassandra
https://doc.akka.io/docs/akka-persistence-cassandra/
Other
329 stars 135 forks source link

Akka Persistence Cassandra retry after DB exceptions #847

Open mckeeh3 opened 3 years ago

mckeeh3 commented 3 years ago

Running tests with Akka Micorservices in Amazon EKS and Keyspaces for Cassandra. When exceptions are thrown due to DB operation timeouts the below queries are retried every few seconds. With version 1.0.3 under the same conditions when these same exceptions occur circuit breakers opened and the retry rate slowed down. Testing indicated that the 1.0.3 behavior recovered more quickly than 1.0.4.

Versions used

Akka version: akka persistence Cassandra 1.0.4

Expected Behavior

When exceptions like this occur expect some circuit breaker reaction with backoff before retry.

Actual Behavior

The below query is performed with no apparent backoff. Queries are retried every few seconds.

Relevant logs

woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-4] - [s0|1919515490][Node(endPoint=3.234.248.222/3.234.248.222:9142, hostId=fd0e98e7-ad76-310d-9094-065cea862868, hashCode=18dbebab)] Error (2 ms) [4 values] woe-twin-749df6f759-q9kpn woe-twin SELECT persistence_id, tag_pid_sequence_nr, timestamp woe-twin-749df6f759-q9kpn woe-twin FROM woe_twin.tag_views WHERE woe-twin-749df6f759-q9kpn woe-twin tag_name = ? AND woe-twin-749df6f759-q9kpn woe-twin timebucket = ? AND woe-twin-749df6f759-q9kpn woe-twin timestamp > ? AND woe-twin-749df6f759-q9kpn woe-twin timestamp <= ? [tag_name='0', timebucket=1606838400000, timestamp=2008b020-33f0-11eb-8080-808080808080, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b] MDC: {} woe-twin-749df6f759-q9kpn woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded) woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [WARN] [akka.persistence.cassandra.query.EventsByTagStage] [] [woe-twin-akka.actor.default-dispatcher-5] - Backtrack failed, this will retried. com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded) MDC: {akkaAddress=akka://woe-twin@192.168.255.109:25520, sourceThread=woe-twin-akka.actor.default-dispatcher-5, akkaSource=EventsByTagStage(akka://woe-twin), sourceActorSystem=woe-twin, akkaTimestamp=16:28:40.103UTC} woe-twin-749df6f759-77xqb woe-twin [2020-12-01 16:28:40,143] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-3] - [s0|1245995343][Node(endPoint=3.234.248.223/3.234.248.223:9142, hostId=8f3b894b-760a-38c2-9db9-bcdc6cf66ef2, hashCode=5192550a)] Error (3 ms) [4 values] woe-twin-749df6f759-77xqb woe-twin SELECT * FROM woe_twin.tag_views WHERE woe-twin-749df6f759-77xqb woe-twin tag_name = ? AND woe-twin-749df6f759-77xqb woe-twin timebucket = ? AND woe-twin-749df6f759-77xqb woe-twin timestamp > ? AND woe-twin-749df6f759-77xqb woe-twin timestamp < ? woe-twin-749df6f759-77xqb woe-twin ORDER BY timestamp ASC woe-twin-749df6f759-77xqb woe-twin [tag_name='0', timebucket=1606838400000, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b, timestamp=4194b8df-33f2-11eb-7f7f-7f7f7f7f7f7f] MDC: {} woe-twin-749df6f759-77xqb woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)

Reproducible Test Case

The problem occurs when Cassandra is pushed to the point that some operations are timing out.

mckeeh3 commented 3 years ago

From the Amazon Keyspaces contact, Michael Raney. "If you made a request that was rejected for a hotkey that’s not counted towards the limit itself. It’s sort of blocked at the front door.

There are two thresholds. 1 for the table and 1 for the individual partitions that make up the partition. If access is too hot on a given partition then it may result in hotkey.

Maybe increase the wait between polling slightly? Few mills."

This appears to be what is happening. The DB is throwing exceptions and the queries retry often, which keeps the DB in an overloaded state.

patriknw commented 3 years ago

Thanks for reporting. We'll look into this shortly.

chbatey commented 3 years ago

The reties are configurable: https://github.com/akka/akka-persistence-cassandra/blob/master/core/src/main/resources/reference.conf#L357 but maybe there is something broken there.

Can you provide all the query logs? To verify the queries are not backing off to every 30s? Each time there is a successful result the backoff is reset

patriknw commented 3 years ago

The retries in eventsByTag works as expected. Exponential backoff according to the configuration, so I don't think that is causing the db overload. The logging of the failed attempts were missing, though, so added in https://github.com/akka/akka-persistence-cassandra/pull/848

Above log mentions "Backtrack failed, this will retried". I will look into if we retry those queries without delays.

chbatey commented 3 years ago

Shall we leave this open for @mckeeh3 to try? Can you try publishing locally Hugh?