deephaven / deephaven-core

Deephaven Community Core
Other
255 stars 80 forks source link

Java Client: Migrate off of deprecated newSession/refreshSessionToken to Flight Auth #3285

Closed nbauernfeind closed 1 year ago

nbauernfeind commented 1 year ago

DHE, as a client-side user of DHC, needs to use Flight Auth. The current version of io.deephaven.client.impl.SessionImpl still uses SessionService::NewSession and SessionService::RefreshSessionToken which are deprecated as they do not fit into either Flight Auth v1 or Flight Auth v2 molds.

There are tests in FlightMessageRoundTripTest that demonstrate how to use Flight Auth v1 and Flight Auth v2. However, they skirt around the need to worry about a session timing out and do not automatically refresh tokens (these are unit tests that aim to catch edge cases and are not concerned with true passage of time).

Fixing up SessionImpl to use the appropriate implementation for authentication will make client-behavior, under the reign of Flight Auth, more obvious, clear, and frankly, usable by DHE.

devinrsmith commented 1 year ago

I'm not savvy to the minutiae of Flight Auth v1 / v2 (happy to look into if necessary).

My thoughts are we should be plumbing with specific targets in mind:

jcferretti commented 1 year ago

In how to migrate I think we need to solve an issue that is left open by the high level nature of the flight spec.

jcferretti commented 1 year ago

I also understand we should settle for Flight Auth v2, as v1 is discouraged in Flight by now. Meaning we should relay on headers, and avoid using the Handshake call for initiating authentication. For the reasons explained in my previous comment, I think we will still need to complement that with:

nbauernfeind commented 1 year ago

The no-op call can and should be an empty Flight::Handshake. If we are using flight v2 this is actually quite easy as the header response to the call ending should include that refreshed token.

The expiration should likely be an additional header that communicates time of expiry. This satisfies @kosak's concern from when we were communicating the periodicity as opposed to a time of "when to refresh" -- which is what each client needs to determine anyways. As long as the "refresh timestamp" has enough round-trip-delay buffer/overhead built in then things are relatively easy for clients. We could also include a "duration remaining" for clock skew concerns.

The action items for me are:

jcferretti commented 1 year ago

Not to be too pedantic about it, but the server should communicate to the client an absolute time of expiration of the token, in the sense of "if I don't hear from you by /then/, I will forget you". It is up to the client to decide when to try to refresh, which obviously needs to happen before the absolute expiration time, and also consider the possibility of transient failure / need to retry. A typical strategy here is to do the minimum between (half way through from when you told me to when it expires, now). So, for example, DHE auth server has a time to live of 10 minutes for a cookie, and clients try to refresh 5 minutes before expiration, which renews the cookie in the server for another 10 minutes, rinse and repeat. So a Time To Live of X for the server results in a regular refresh from the client that, under normal circumstances, is X/2, and under abnormal circumstances/network transient error, gives the client X/2 to try to get back to the server before being forgotten.

kosak commented 1 year ago

I don't remember caring about this, but I'm likely forgetting some conversation.

I think this is in your capable hands, but I would add two thoughts:

  1. I don't think traditional notions of "clock skew" should be a concern at these gross timescales (on the order of minutes). I believe the right answer is for the server to fib and say "your token expires at 3:10pm" but secretly the server will not expire the token until 3:11pm, and that gives the protocol an inherent 1 minute of slop, which papers over any reasonable amount of clock skew that might exist. And we don't document the slop except to have a disclaimer somewhere in our documentation that we tolerate up to 30 seconds (or whatever) of clock skew.
  2. However, despite the above, regarding absolute vs relative, one thing I'm triggering on right now is that the "absolute" version requires every endpoint to have a good notion of the current actual time, whereas the "relative" version only requires the endpoint to be able to run a half-decent countdown timer. It seems plausible to me that there are some endpoints (some customer's poorly maintained Windows box, or one where they've deliberately set the clock back for whatever reason) that doesn't know what time it is. We wouldn't want such a system to work unreliably with Deephaven over what is ultimately a rather fussy reason. You may want to weigh in on whether that is a real concern.

I think my vote would fall more down on "relative expiration times with fibbing for extra slop", but to be clear, I mean that the lease does not expire on a fixed 10 minute interval, but rather that the lease gets extended whenever the client feels like renewing (T/2 is certainly a reasonable time for the client to wake up and send a refresh). I am somewhat neutral on the design choice of whether the lease should be extended upon every interaction, or only on an explicit Handshake message.

jcferretti commented 1 year ago

To followup on Corey's comment above, I think there is a particular failure mode that we need to consider here, which in my mind argues strongly in favor of absolute time versus relative.

Network transient errors are common (eg, you lose wifi, a switch reboots in the datacenter...). Whatever the root cause, for the same reason that we code retries (and gRPC allows you to automate them and make them transparent), we need to consider the case where the server is about to send a message back to the client with some notion of time (say, the message says "in 120 seconds"), and at that point the network fails and gRPC is retrying to send the message for 90 seconds.

By the time it succeeds 90 seconds later and the client receives the message, how is the client going to interpret the "120 seconds" figure? Bottom line it is unknowable for peers over the network when a message was actually sent, so it is not possible to understand relative notions of time. Clock skew is a real problem, and network transient errors is another real problem. Of the two problems, I think network transient errors trump, because (1) clock skew in the order of seconds is already considered very bad, and I claim (maybe this is a bullshit claim, buyer beware) that a machine with bad clock skew means a lot of services in that machine will already not work, and (2) network transient errors can be in the order of scores of seconds / low minutes.

I have another bullshit claim: "the right way to do" these kind of "lease renewal network protocols" is by using absolute time. I think I need to back that claim by looking / reading a bit. I think I can sustain that claim by looking at the implementation of jetcd and either reading or looking at the implementation of NFS v4 (which I think "invented" the idea of client-side leases for their to-you-client-for-now-exclusive-cached-write-access idea).

When I have some time I'll research that a bit and come back. If I forget somebody can throw something at me when this ticket activates.

jcferretti commented 1 year ago

I looked both in the chubby paper and in the etcd gRPC API documentation.

The chubby paper reference is interesting. The description of the behavior is very detailed, and while the use case here is somewhat different (lease renewal for a fault-tolerant critical system), the details probably capture a lot of know-how so I think it makes sense to pay some attention (not saying we should duplicate anything verbatim, saying that thinking about it may suggest ideas and shed light).

From https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

This is the quote I found relevant:

A client requests a new session on first contacting the
master of a Chubby cell. It ends the session explicitly
either when it terminates, or if the session has been idle
(with no open handles and no calls for a minute).
Each session has an associated lease—an interval of
time extending into the future during which the master
guarantees not to terminate the session unilaterally. The
end of this interval is called the session lease timeout.
The master is free to advance this timeout further into
the future, but may not move it backwards in time.
The master advances the lease timeout in three cir-
cumstances: on creation of the session, when a mas-
ter fail-over occurs (see below), and when it responds
to a KeepAlive RPC from the client. On receiving a
KeepAlive, the master typically blocks the RPC (does
not allow it to return) until the client’s previous lease in-
terval is close to expiring. The master later allows the
RPC to return to the client, and thus informs the client of
the new lease timeout. The master may extend the time-
out by any amount. The default extension is 12s, but an
overloaded master may use higher values to reduce the
number of KeepAlive calls it must process. The client
initiates a new KeepAlive immediately after receiving
the previous reply. Thus, the client ensures that there
is almost always a KeepAlive call blocked at the master.

Note it doesn't say explicitly if the timeout is given to the client in absolute or relative terms, but since the call is initiated immediately by the client, and held at the server and not returned until right before the lease is extended, the issue may be avoided?

The etcd protocol for leases (which are not the same thing, in the sense that this is not a session, is a server-held object on behalf of a particular client for the purposes of protocol operations) is simple minded and relative, not absolute, time:

https://etcd.io/docs/v3.4/learning/api/#lease-api

I am inclined to mumble something like "these seems just wrong". But maybe I am just a sore loser; this protocol is based on relative time and everybody uses it and no big deal, so maybe this proves that, at big enough intervals, it doesn't matter?

jcferretti commented 1 year ago

The NFS v4 spec is also pretty careful about wording:

https://www.ietf.org/rfc/rfc3530.txt

The text below seems to imply that the client estimates a time range that implies expiration could have occured. I talked a bit over google meet with Corey and I have to say, the implied protocol below sounds pretty similar to an idea he had to make the relative time "work" in the face of network failures.

   The second lock revocation event is the inability to renew the lease
   before expiration.  While this is considered a rare or unusual event,
   the client must be prepared to recover.  Both the server and client
   will be able to detect the failure to renew the lease and are capable
   of recovering without data corruption.  For the server, it tracks the
   last renewal event serviced for the client and knows when the lease
   will expire.  Similarly, the client must track operations which will
   renew the lease period.  Using the time that each such request was
   sent and the time that the corresponding reply was received, the
   client should bound the time that the corresponding renewal could
   have occurred on the server and thus determine if it is possible that
   a lease period expiration could have occurred.

Actually another part of the spec is explicit about network delays and the fact that the protocol uses relative time:

8.13.  Clocks, Propagation Delay, and Calculating Lease Expiration

   To avoid the need for synchronized clocks, lease times are granted by
   the server as a time delta.  However, there is a requirement that the
   client and server clocks do not drift excessively over the duration
   of the lock.  There is also the issue of propagation delay across the
   network which could easily be several hundred milliseconds as well as
   the possibility that requests will be lost and need to be
   retransmitted.

   To take propagation delay into account, the client should subtract it
   from lease times (e.g., if the client estimates the one-way
   propagation delay as 200 msec, then it can assume that the lease is
   already 200 msec old when it gets it).  In addition, it will take
   another 200 msec to get a response back to the server.  So the client
   must send a lock renewal or write data back to the server 400 msec
   before the lease would expire.

   The server's lease period configuration should take into account the
   network distance of the clients that will be accessing the server's
   resources.  It is expected that the lease period will take into
   account the network propagation delays and other network delay
   factors for the client population.  Since the protocol does not allow
   for an automatic method to determine an appropriate lease period, the
   server's administrator may have to tune the lease period.

Another source (who is this?) from "understandingpaxos" site

Timing Considerations for Master Leases One practical point of note is that absolute time in distributed systems generally isn’t trustworthy. Even when time synchronization protocols are used, there are many ways in which the clocks on systems can differ significantly from one another. Human errors such as misconfigured synchronization daemons, timezone settings, and firewall rules are uncommon but not unheard of in most environments and there is always the possibility for those rare but real hardware failures that reset the clock of a system to years in the past, years in the future, or anything in between. A good rule of thumb in designing distributed systems is to avoid relying on synchronized clocks for correct operation whenever possible. To that end, one approach for managing the time associated with master leases is for each peer to use relative durations based exclusively off of their own local clocks. Each peer defines the lease expiration time to be the local time at which they learned of the lease’s existence plus the lease duration. Due to network latencies and variable load on the individual peers, few, if any, peers will precisely agree on the exact lease expiry time in an absolute sense but the critical aspect of the leases, which is preventing a peer from assuming master status before the lease truly expires, is protected. When attempting to first gain a master lease and also when renewing it, the lease timer is started before the first message to establish the lease is sent. Because there is at least some minimal delay required in achieving consensus on the new lease, the master’s local lease timer is guaranteed to expire before the rest of the peers, thus preventing any window of opportunity in which one of the other peers may gain master status while the current master believes it still holds the lease. Faulty clock hardware on a peer could cause it to think that a lease had elapsed early but this problem is mitigated by the requirement that all peers drop Permission Request and Suggestion messages originating from all non-master peers while they believe the lease is held. As long as a majority of peers do not have similarly faulty hardware, the master’s status is protected.
jcferretti commented 1 year ago

Another reference. Martin Fowler.

https://martinfowler.com/articles/patterns-of-distributed-systems/time-bound-lease.html https://martinfowler.com/articles/patterns-of-distributed-systems/heartbeat.html

What Martin Fowler seems to saying above is that the client has a notion of expiration time and it will try send heartbeats at a certain 'rate' to ensure the expiration is not hit.

I think from all these references what shapes up in my brain is all trying to devise a system that /avoids/ to have to explicitly talk about absolute time; these systems all seem to try to avoid to depend on clock synchronization (although the NFS v4 spec is explicit about things not working if clock drift is "not excessive").

The Corey idea about client time estimation (client takes a timestamp, sends a message to server, receives message back, can compare timestamp of sent message, received message, and message relative time data to establish a "bound" on when to renew) feels sound to me in this light.

One way where the client could avoid having to "remember" the sent time is to include its send time in the RefreshRequest message payload. The server could just echo that back to the client in the RefreshResponse, include its own notion of server time (second field), and add the expiration (relative) time (third field).

Note "RefreshRequest" and "RefreshResponse" above are just a way to talk about this / a way to think in the context of these protocols; for what we need Nate already specified that we are going to be using an empty Handshake request and any "payloads" are going to be in headers.

When actually having control of a unary RPC (which in this case we don't), the Chubby idea of holding the RPC on the server and releasing it only shortly before expiration time, which matches its renewal, and at that point release the RPC to return to the client, which in turns immediately invokes it back to the server, sounds to me like a great approach, in that it relies on gRPC implementation (with proper configured retries etc), and it is completely independent of the actual intervals being enforced by the server (eg, in this protocol the client doesn't even need to know; it only needs to know to make sense of potential failures of the RPC renew call and decide on how to react)

jcferretti commented 1 year ago

Related: this suggests to me the problem of clock drift is alive and kicking.

https://www.redhat.com/en/blog/avoiding-clock-drift-vms

jcferretti commented 1 year ago

It looks like the refresh retry logic in SessionImpl does retries in a tight loop:

2023-01-11 19:17:40.576 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] WARN io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, trying again
2023-01-11 19:17:40.578 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] WARN io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, trying again
2023-01-11 19:17:40.579 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] WARN io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, trying again
2023-01-11 19:17:40.580 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] WARN io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, trying again
2023-01-11 19:17:40.582 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] WARN io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, trying again
2023-01-11 19:17:40.584 cfs-230111-dndauth STDERR worker_3 f89fd3a5-bc5b-4e75-9a32-43cc1cb2199f iris iris [grpc-default-executor-38] ERROR io.deephaven.client.impl.SessionImpl$Retrying - Error refreshing token, giving up

This is a class that depends on functionality that is deprecated, but as DHC stands now is still necessary. Those retries should have at a minimum some linear wait, and ideally some initial exponential backoff landing on a more spaced linear. There is a budget to retry, I don't see the point of stopping after a fixed number of attempts either, seems natural to keep trying until the budget runs out.

Also, there is a Throwable in that message that is being swallowed by the logging:

        public void onRefreshTokenError(Throwable t, Runnable invokeForRetry) {
            if (remainingRefreshes > 0) {
                remainingRefreshes--;
                log.warn("Error refreshing token, trying again", t);
                invokeForRetry.run();
                return;
            }
            log.error("Error refreshing token, giving up", t);
        }
nbauernfeind commented 1 year ago

The tight loop is probably caused by call options that should be configured different from the default grpc behavior. Thanks for letting me know; I'll address this as well.

nbauernfeind commented 1 year ago

Updating this for my own reference (or Devin's, if he ends up updating the java client insead of me). The configuration service advertises the session duration and we do not need to add an additional header in auth-related responses.

devinrsmith commented 1 year ago

I'd like to take a stab at this to better understand how the mechanism work, how it might effect the implicit (or explicit) effect on DH URIs, and in relationship to https://github.com/deephaven/deephaven-core/pull/3410.

The no-op call can and should be an empty Flight::Handshake

The lowest layer, java-client-session, does not depend on flight, and I'm hoping to keep it that way. Maybe it does make sense for a DH equivalent. Or maybe, we can just shim our own flight service proto just for that method. Or maybe it's okay to depend on just the raw proto & stubs, but keep all other dependencies at bay.