DSBulk's rate limiter is not compatible with speculative executions

adutra commented 2 years ago

By default, both rate limiting and speculative executions are disabled in DSBulk.

If both are enabled, we observed that, when writing and from the server's perspective, the rate limit is not honored.

This is because rate limit permits are acquired per row written. More specifically, each call to session.executeAsync() will need to acquire permits.

If the request is retried internally by the driver, that's fine, because a retry request is only sent when the initial request has finished, so the invariant acquired <= available is respected and the server never sees more than available concurrent requests.

However, if we enable speculative executions, the driver may trigger a speculative request while the initial request is still in-flight. This will be done without acquiring more permits, since only one call to executeAsync was done. From the client's perspective, the invariant acquired <= available looks respected (we are writing one row only, but with 2 requests), but from the server's perspective, 2 requests were received and the server may find itself processing more than available concurrent requests.

The immediate consequence of such a setup is that Astra starts returning OverloadedExceptions even after #435 was implemented.

I don't have any immediate solution for that. I think we would need to change how permits are acquired for writes: each internal request that the driver sends needs to acquire permits, not only the initial one.

We can achieve this in a few ways – but all of them involve extending driver classes:

Move Guava's RateLimiter to CqlRequestHandler, and acquire the permits each time a message is written to the Netty channel, see here.
Use the driver's built-in RateLimitingRequestThrottler. But we'll need to improve this mechanism:
- The RequestThrottler interface will need to access the statement being executed, in order to compute the number of permits;
- The throttler is currently not invoked for speculative executions anyways. This is probably a bug btw.

I will open driver Jiras for improving the throttling mechanism. But even so I'm reluctant to try the above changes:

Moving Guava's RateLimiter inside CqlRequestHandler means that we are calling a blocking operation in a driver IO thread. This is considered bad practice and could have undesired consequences.
Using the driver's built-in RateLimitingRequestThrottler would avoid blocking operations, but it uses instead an internal queue to park requests. When the queue is full, it throws an error. This might also be undesirable for DSBulk.

Note: I think that reads are not a problem. For reads, permits are acquired per row emitted, after the results page has been received. So speculative executions won't pose any problem here.

Note2: it might be simpler to just give up on rate limit + speculative execs and document this limitation.

Note3: to mitigate this, we could look into something simpler: implementing application-level retries when a write request ends with a DriverTimeoutException. I will create a separate issue for that.

┆Issue is synchronized with this Jira Task by Unito

adutra commented 2 years ago

Issue for application-level retries: #448.

adutra commented 2 years ago

Java driver issues for improving built-in throttling:

JAVA-3036 Expose request in Throttled API JAVA-3037 Apply throttling to retries and speculative executions

adutra commented 2 years ago

This is considered bad practice and could have undesired consequences.

I'd note that Guava's RateLimiter is currently invoked inside Reactor operators and/or (for reads only) also in driver I/O threads: both are bad practices. So all in all, we are already drowning in muddy waters wrt to using blocking code in non-blocking contexts.

datastax / dsbulk

DSBulk's rate limiter is not compatible with speculative executions #447