deephaven / deephaven-core

Deephaven Community Core
Other
246 stars 79 forks source link

DH protocols mismatches (client/server) should be detected and error out with a meaningful message. #1945

Open jcferretti opened 2 years ago

jcferretti commented 2 years ago

We have had two bencher breaks resulting from java client API changes, that did not update bencher. That in itself is not a big problem, except that the actual failure you experience after a protocol breaking change is unscrutable: bencher would fail to parse some server answers and fail with a low level grpc exception that was hard to relate to the real root cause.

Execution of "load tables" failed: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: FAILED_PRECONDITION: Details Logged w/ID '812fbd47-e6b2-4224-84a0-1669bc8d0afb'

(this later break was root caused to change https://github.com/deephaven/deephaven-core/commit/9853bda66ccbd033304a28f65f2611bd4e5f83bb)

Also, Amanda reported today:

image (2)

Which according to slack discussion, has happened before when there is a protocol mismatch triggered by web and server containers version mismatch (note here the protocol is the one used by the web UI, in the first bencher example is the java client API).

So in general, it seems that we need a mechanism to avoid failing in an opaque way when there is a protocol mismatch.

jcferretti commented 2 years ago

The commit for the last bencher breaking change included enough changes in the proto definitions that one would think gRPC would be able to detect such a mismatch:

git diff a716fc8de3fcd2531524ebf9fe15eb22095a9ba4 9853bda66ccbd033304a28f65f2611bd4e5f83bb proto/proto-backplane-grpc/src/main/proto/deephaven/proto/application.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/console.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/inputtable.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/object.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/session.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/table.proto proto/proto-backplane-grpc/src/main/proto/deephaven/proto/ticket.proto

Which includes:

@@ -24,7 +24,7 @@ service ApplicationService {
    * - Subsequent messages modify the existing state. Fields are identified by
    *   their ticket and may be replaced or removed.
    */
-  rpc listFields(ListFieldsRequest) returns (stream FieldsChangeUpdate) {}
+  rpc ListFields(ListFieldsRequest) returns (stream FieldsChangeUpdate) {}

 }

@@ -45,60 +45,9 @@ message FieldsChangeUpdate {
  * A lightweight object describing the exposed field.
  */
 message FieldInfo {
-  Ticket ticket = 1;
+  TypedTicket typed_ticket = 1;
   string field_name = 2;
-  FieldType field_type = 3;
-  string field_description = 4;
-  string application_name = 5; // display-friendly identification
-  string application_id = 6; // computer-friendly identification
-
-  message FieldType {
-    oneof field {
-      CustomInfo custom = 1;
-      TableInfo table = 2;
-      // reserved = 3; // for TreeTable
-      // reserved = 4; // for TableMap
-      FigureInfo figure = 5;
-      // reserved = 6; // for PANDAS
-
-      // use values above 4096 for custom client behavior
-    }
-  }
-}
[...] snip snip

I /would have thought/ that any change that modifies a service method signature (name change) or that changes the types AND number of arguments would be detected by gRPC. Colin just explained to me that I am wrong (which is consistent with above and gRPC just pushing forward obliviously happy).

Maybe that is why Microsoft seems to advice that you go ballistic and identify the protocol version in the package name definition in the proto file, eg, like package greet.v1; https://docs.microsoft.com/en-us/aspnet/core/grpc/versioning?view=aspnetcore-6.0

jcferretti commented 2 years ago

The way arrow flight seems to do the check is via requiring an initial Handshake message: https://github.com/apache/arrow/blob/9910b6ec0d22f23680651c26d2ddb8620c641e6c/format/Flight.proto#L41, https://github.com/apache/arrow/blob/9910b6ec0d22f23680651c26d2ddb8620c641e6c/format/Flight.proto#L129

devinrsmith commented 2 years ago

GRPC doesn't have a way to natively tell that messages are incompatible - it's a feature that ostensibly allows for protocols to move forward while maintaining backwards-compatibility assuming the developers did things "correctly".

In this case, we explicitly changed things in a backwards-incompatible way - it relieves us of having to maintain a "legacy" API that is of little value moving forward. It's something we are "afforded" by semantic versioning rules for "0.x" releases - once we release version "1.0.0", we'll need to have a strategy around "not breaking clients". This may involve versioning the proto messages as v1 explicitly, or adopting a server-side RPCs whereby the client can pass its language and version information so the server can send RPC by RPC appropriate messages that things have changed in a backwards-incompatible way.

devinrsmith commented 2 years ago

Of course, we may also want to have better errors messages before 1.0.0; it's just that we may break things between 0.x.0 releases.

jcferretti commented 2 years ago

As I commented in slack, I think breaking things at the beginning is fine/necessary, I think we need good error messages when that happens to avoid people wasting time trying to figure out why things broke. So the means for quick diagnostic is all I think we need.