During the last FailureFriday we simulated high network latency by suspending processes for two out of five Cassandra nodes in our cluster, which resulted in an unexpected service degradation.
The problem was traced back to the SmaLatencyScoreStrategyImpl not getting score updates when requests are timing out with SocketTimeoutExcpetion. Because the scores are not updated, unreachable nodes keep the highest possible score of 0.0 and remain in the available pool. In turn, the round-robin host selector gets stuck on unresponsive hosts, causing the driver to operate in a severely degraded state.
This patch addresses the issue by fixing exception handling for SocketTimeoutExceptions.
During the last FailureFriday we simulated high network latency by suspending processes for two out of five Cassandra nodes in our cluster, which resulted in an unexpected service degradation.
The problem was traced back to the SmaLatencyScoreStrategyImpl not getting score updates when requests are timing out with SocketTimeoutExcpetion. Because the scores are not updated, unreachable nodes keep the highest possible score of 0.0 and remain in the available pool. In turn, the round-robin host selector gets stuck on unresponsive hosts, causing the driver to operate in a severely degraded state.
This patch addresses the issue by fixing exception handling for SocketTimeoutExceptions.