MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
29 stars 4 forks source link

Transport Agnostic RPC #249

Closed CMCDragonkai closed 1 year ago

CMCDragonkai commented 2 years ago

We have reviewed the gRPC API, and worked with it extensively. It's time to work out a better RPC layer for PK. There are several problems to to consider here:

We have 2 main proto files:

  1. proto/schemas/Client.proto
  2. proto/schemas/Agent.proto

And a Test.proto as well, this will need to be used to generate the marshaling code.

Additional context

Tasks

  1. [ ] - Review the proto3 guide, naming standard with reference to the proto files, look for ways to abstract or clean up the data
    • 279

    • [x] - Version the Client.proto and Agent.proto with version names and test out multiple-version services, and find out what the client does when the version wanted is not available
    • [ ] - Test out grpc reflection for versioning if necessary
    • [ ] - Update all the primitive types, and any usage of 64 bit int should have the string hint
    • [ ] - Add some benchmarks to the grpc, and check if MAX_CONCURRENT_CONNECTIONS is used
    • [ ] - Clean up types #200 or gRPC abstractions based on new grpc libraries
    • [ ] - Consider how to shutdown the grpc server and terminate all client connections as well
  2. [ ] - Figure out CQRS, pagination and streaming
    • [ ] - Clean up js-pagination client and server side utilities
    • [ ] - Pagination on unary calls, and streaming calls that can be initialised with a pagination parameters (which would enable some sort of CQRS and event sourcing concepts)
  3. [ ] - Review in reference to authentication and sessions
  4. [ ] - Check the HTTP API and web gateway compatibility
  5. [ ] - Deal with the fact that the new grpc-js 1.4.1 version has some incompatibility: TypeError: http2Server.on is not a function.
CMCDragonkai commented 2 years ago

Note that all fields in proto3 is optional and this is all facilitated by a default value. https://stackoverflow.com/questions/31801257/why-required-and-optional-is-removed-in-protocol-buffers-3

The default values are all based on the types of the fields. So for primitives like numbers, it's 0, but for other subtypes, it is null in JS.

Optional fields returned in proto 3.5. What this means is usually generates a handler to check whether it was set at all, that way you can differentiate between a value not being set by the client versus the client setting the value to the default value. This can be important when you want to change behaviour if the client really didn't set the value.

with optional you get the ability to explicitly check if a field is set. Lets say you have a field int32 confidence. Currently when receiving a message which such a type you cannot know the difference between confidence = 0 or confidence not set. Because default values are optimized away in the serialization. If you mark the field as optional then presumably some extra bits are set in the serialization and a has_confidence() method will be generated so that you on the receiving end can disambiguate the two.

CMCDragonkai commented 2 years ago

Regarding all the number types in proto3. There are 64 bit numbers:

int32
int64
uint32
uint64
sint32
sint64
double
float

Due to this issue https://github.com/protocolbuffers/protobuf/issues/3666. There's no support for bigint yet. The default loses precision.

Work around is:

message dummy { 
  uint64 bigInt = 1; [jstype = JS_STRING] 
}

This turns it into a string.

CMCDragonkai commented 2 years ago

This shows the mapping from the proto3 code to the generated JS code: https://developers.google.com/protocol-buffers/docs/reference/javascript-generated, of interest is the map, bytes, one of, and enums.

CMCDragonkai commented 2 years ago

The guide says that if we want to stop using a certain field in a message. It's important to reserve the field number and also the field name so that it cannot be used again. This means that older clients won't get confused.

Basically the field numbers don't have to be actually in sequence. They represent unique positions in the message.

message Foo {
  reserved 2, 15, 9 to 11;
  reserved "foo", "bar";
}

enum Foo {
  reserved 2, 15, 9 to 11, 40 to max;
  reserved "FOO", "BAR";
}
CMCDragonkai commented 2 years ago

Regarding pagination I like this usage of repeated:

message SearchResponse {
  repeated Result results = 1;
}

message Result {
  string url = 1;
  string title = 2;
  repeated string snippets = 3;
}

It seems that this makes it easier to define a single message type and then lists of messages. I had done this previously in OpenAPI when I had situations where you have GET /resource/1 vs GET /resources where the former acquires a single resource, and the latter fetches a page of resources.

I imagine that responses may have further metadata beyond just the core data. So I could imagine types being represented like:

// the actual domain structure
XDomainStructure {
  // ...
}

// a single resource /resources/1
YResponse {
  XDomainStucture x_domain = 1;
}

// multiple resources /resources
ZResponse {
  repeated XDomainStructure x_domains = 1;
}

That way "response" messages are differentiated from domain structures that we want to work with.

So in our proto definitions we can identify the relevant domain structures we want to work with, these should really be derived manually by the programmer by investigating each domain. For example in our notifications domain, we have a domain structure representing a notification message. But this is not the same type as the type of request & response messages being sent back and forth on the gRPC API about notifications.

CMCDragonkai commented 2 years ago

It's possible to also import proto files. This can make it easier to manage our proto definitions if we separate our type definitions from our message definitions from our service definitions. It could allow us to form subdirectories in src/proto/schemas. We would have to check if our CLI command scripts/proto-generate.sh can actually do this though.

Importing is just a matter of:

import "myproject/other_protos.proto";

To re-export, just use import public "other.proto";.

I believe the name should start from where the --proto_path is set to. This should be src/proto/schemas.

CMCDragonkai commented 2 years ago

I forgot if we were ever able to use any of the google protos. Like import "google/protobuf/any.proto";. The any type allows us to nest any kind of message type. It will be serialized as bytes. We are likely not going to use this particular type google.protobuf.Any though.

However there are a number of other types that are quite useful like google/protobuf/timestamp.proto. That's all located here: https://developers.google.com/protocol-buffers/docs/reference/google.protobuf

CMCDragonkai commented 2 years ago

The oneof can have properties that use different field numbers. But no matter what if you set any one of those properties, it should clear the other properties.

The map only works for fixed types for keys and values. So for arbitrary values, we should use JSON encoding and return it as a string.

The map also doesn't work as nested. So a value of map can't be another map.

Although you should be able to put another map inside a message type, and that would be sufficient to nest maps.

CMCDragonkai commented 2 years ago

Need to point out that re-reading https://medium.com/expedia-group-tech/the-weird-world-of-grpc-tooling-for-node-js-part-3-d994de02bedc seems to mean that our current usage of static tooling means that importing packages may not work. Not sure... at this point.

CMCDragonkai commented 2 years ago

This issue seems to indicate all we need to do is copy the proto source code verbatim in our src/proto/schemas and then it should work. https://github.com/agreatfool/grpc_tools_node_protoc_ts/issues/72

CMCDragonkai commented 2 years ago

Note that a gRPC channel tends to have a limit number of max concurrent connections (from that point onwards rpc requests are queued). I believe we have separate channels to each different node. We aren't likely to hit this max number in production right now, but the chattiness may increase when we integrate gestalt synchronisation #190. This issue has more details https://github.com/grpc/grpc/issues/21386. The current work around is to create new channels and load balance between channels, however this is considered to be a temporary solution.

Use a pool of gRPC channels to distribute RPCs over multiple connections (channels must have different channel args to prevent re-use so define a use-specific channel arg such as channel number).

This will impact any benchmarking we do that involves launching multiple concurrent gRPC requests.

This limit is mentioned here: https://github.com/grpc/grpc-node/blob/master/PACKAGE-COMPARISON.md and also compares the @grpc/grpc-js vs grpc library (that is now deprecated).

CMCDragonkai commented 2 years ago

Regarding API versioning. gRPC is naturally backwards compatible. So as long as there are no breaking changes to the API, it's possible to keep using the same version, and clients can continue to work as normal. However when there is a breaking change, the MS guide recommends using package names.

So right now we have agentInterface, but we can instead call it agentInterface.v1. And then increment this version specifier whenever we have backwards incompatible changes.

This basically means one can create different service interfaces, and it is possible to run multiple versions of the same API. So right now we may have agent service, but we may also run 2 agent services on the same port. This would be similar to having api/v1 and api/v2 in a RESTful URL where both still work. The only challenge here is to maintain handlers for the old API as well.

However this doesn't telegraph to the client side what version it is using? It's just a matter of the client selecting which version they want to use, and hoping that the agent side still has that version. Maybe the lack of that version service existing is enough to signal to the client that their source code version is too old and needs to be updated. This will need to be prototyped in the test GRPC to see what happens when the relevant versioned package is no longer being offered as a service, what exceptions/errors occur.

This would benefit from being able to break up the proto files into subdirectories so that way common messages and type declarations can be shared.


It seems that there was an idea to use server reflection for the client to query about the protobuf descriptions. https://stackoverflow.com/a/41646864/582917

Server reflection is not available directly on grpc-js https://github.com/grpc/grpc-node/issues/79. However there are libraries that have built on top to include it:

The examples given are that grpcurl can then be used directly without having access to the proto files since it is possible to request the proto files from the service.

It has some relationship to the descriptor proto: https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto

Also there's option declaration. Not sure how that works yet.

CMCDragonkai commented 2 years ago

Found an interesting library that builds on top of @grpc/grpc-js https://github.com/deeplay-io/nice-grpc they seem to have done a lot of work on gRPC js and making it more user friendly. It's too complicated right now to integrate, but would be worth checking out later.

CMCDragonkai commented 2 years ago

Relevant errors that we should have more tests for: https://www.grpc.io/docs/guides/error/

CMCDragonkai commented 2 years ago

Recommend reading over this @tegefaulkes when you're finished with your checklist in vaults refactoring.

tegefaulkes commented 2 years ago

Playing around with getting the message definitions to exist within their own .proto files and packages.

It seems doable, I can define for example the vault message inside domains/Vaults.proto and import them into Client.Proto with import public "domains/Vaults.proto";. From there I can use the messages inside a service by doing

//OLD
rpc VaultsRename(VaultRenameMessage) returns (VaultMessage) {};

//NEW
rpc VaultsRename(Vault.Rename) returns (Vault.Vault);

When construction the messages now we can do VaultMessage = new clientPB.Vaults.Rename(). I'm likely going to rename clientPB to messages at some point.

CMCDragonkai commented 2 years ago

Are package/file names meant to be lowercase? It seems that way from other examples. Have a look at how one is supposed to import google proto library.

On 10/20/21 3:34 PM, Brian Botha wrote:

Playing around with getting the message definitions to exist within their own |.proto| files and packages.

It seems doable, I can define for example the vault message inside |domains/Vaults.proto| and import them into |Client.Proto| with |import public "domains/Vaults.proto";|. From there I can use the messages inside a service by doing

|//OLD rpc VaultsRename(VaultRenameMessage) returns (VaultMessage) {}; //NEW rpc VaultsRename(Vault.Rename) returns (Vault.Vault); |

When construction the messages now we can do |VaultMessage = new clientPB.Vaults.Rename()|. I'm likely going to rename |clientPB| to |messages| at some point.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-polykey/issues/249#issuecomment-947321119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHOZF7IUWNRNT3OLHXTUHZBGFANCNFSM5GFZQLYA.

tegefaulkes commented 2 years ago

I didn't check what it should be but I can change it if needed.

CMCDragonkai commented 2 years ago

I think we should have in our src/config.ts:

The sourceVersion is ideally fetched from the package.json. However we don't want to use import packageJson from '../package.json'; because that screws up the dist build. Instead you can make config.ts read up one directory. When you distribute src/config.ts, it becomes dist/config.ts. This then can access the ../package.json that will be in the NPM distribution.

However I'm not sure if this will work under vercel/pkg. @tegefaulkes can you check this out as well.

CMCDragonkai commented 2 years ago

@tegefaulkes remember to tick off things as you're making progress. And if you have a new PR/branch, create a PR for it and link it here too. The task list should be copied there.

tegefaulkes commented 2 years ago

I'm still working on creating common domain types and common message types. Almost done with that.

I've created a new branch off of master for this under API_Review

tegefaulkes commented 2 years ago

Consider how to shutdown the grpc server and terminate all client connections as well

Looks like GRPCServer.ts already has a method for doing this.

public closeServerForce(): void {
   this.server.forceShutdown();
}
tegefaulkes commented 2 years ago

When it comes to the sessions and authentication I feel there is a lot of duplicated code that could be made a utility function. On the client side we have code that updates the session token. This is in every CLI command.

const pCall = grpcClient.nodesClaim(nodeClaimMessage);
const { p, resolveP } = utils.promise();
pCall.call.on('metadata', async (meta) => {
  await clientUtils.refreshSession(meta, client.session);
  resolveP(null);
});
const response = await pCall;
await p;

Likewise on the server side we have code that checks the session token and sends an updated token to the client. this is a the beginning of each RPC method.

await sessionManager.verifyToken(utils.getToken(call.metadata));
const responseMeta = utils.createMetaTokenResponse(
  await sessionManager.generateToken(),
);
call.sendMetadata(responseMeta);

If we had to update either of these it would be annoying.

CMCDragonkai commented 2 years ago

Prototype by abstracting into subdirectories under src/proto/schemas, and integrate google's protobuf library by first referencing it and then just copying verbatim

This can be split off and merged in earlier as we focus on getting the core work done in #194.

CMCDragonkai commented 2 years ago

New proto schemas structure:

[nix-shell:~/Projects/js-polykey/src/proto/schemas]$ tree .
.
├── google
│   └── protobuf
│       ├── any.proto
│       ├── descriptor.proto
│       ├── duration.proto
│       ├── empty.proto
│       ├── field_mask.proto
│       ├── struct.proto
│       ├── timestamp.proto
│       └── wrappers.proto
└── polykey
    └── v1
        ├── agent_service.proto
        ├── client_service.proto
        ├── gestalts
        │   └── gestalts.proto
        ├── identities
        │   └── identities.proto
        ├── keys
        │   └── keys.proto
        ├── nodes
        │   └── nodes.proto
        ├── notifications
        │   └── notifications.proto
        ├── permissions
        │   └── permissions.proto
        ├── secrets
        │   └── secrets.proto
        ├── sessions
        │   └── sessions.proto
        ├── test_service.proto
        ├── utils
        │   └── utils.proto
        └── vaults
            └── vaults.proto

14 directories, 21 files

New versions should go into polykey/v2. But only that which is changed. Except for the services which must always be copied.

Next iteration of API review should specify the usage of google/protobuf.

CMCDragonkai commented 2 years ago

The above should go into the developer documentation.

CMCDragonkai commented 2 years ago

The current separated types are still quite messy. The shared types being used in each domain should match the actual domain types that are used inside the TypeScript source code. At least the types that we want to transfer over the wire.

After this, we need specific types for XRequest and YResponse message wrappers. These are different and unique to the domain protobuf message structs. Following this standard: https://docs.buf.build/lint/rules#rpc_request_standard_name-rpc_response_standard_name-rpc_request_response_unique

Other struct types should not have Request and Response suffix and are used as properties of XRequest and YResponse types.

If dealing with streams, they would just be streams of XRequest and YResponse types.

The name of each request/response message should follow the RPC request being used for it. This means these types will need to be put into src/proto/schemas/client_service.proto and src/proto/schemas/agent_service.proto. Unfortunately this will impact imports across the codebase, so some find and replace will need to be used.

I used a trick Ctrl + Shift + H in vscode to do it.

RFC @tegefaulkes @emmacasolin @scottmmorris @joshuakarp

Doing this will reduce the number of imports like this:

import * as utilsPB from '../../proto/js/polykey/v1/utils/utils_pb';
import * as nodesPB from '../../proto/js/polykey/v1/nodes/nodes_pb';
import * as gestaltsPB from '../../proto/js/polykey/v1/gestalts/gestalts_pb';
import * as permissionsPB from '../../proto/js/polykey/v1/permissions/permissions_pb';

And recover back what we used to do:

import * as clientServicePB  from '../../proto/js/polykey/v1/client_service_pb';
// clientServicePB is basically clientPB
joshuakarp commented 2 years ago

So we abstract the lower-level message types to higher-level? i.e. instead of:

// agent_service.proto
rpc NodesChainDataGet (polykey.v1.utils.EmptyMessage) returns (polykey.v1.nodes.ChainData);

we'd have

// agent_service.proto
rpc NodesChainDataGet (polykey.v1.agent_service.NodesChainDataGetRequest) returns (polykey.v1.nodes.NodesChainDataGetResponse);

The only issue I see with this is you'd lose the ability to clearly compare different service functions and see if their request/response type is the same. It also makes it a bit more convoluted to actually construct (need to create an extra message type before being able to do anything).

Having said all of this, interestingly my opinions on this go against the standard from what you linked:

One of the single most important rules to enforce in modern Protobuf development is to have a unique request and response message for every RPC. Separate RPCs should not have their request and response parameters controlled by the same Protobuf message, and if you share a Protobuf message between multiple RPCs, this results in multiple RPCs being affected when fields on this Protobuf message change. Even in simple cases, best practice is to always have a wrapper message for your RPC request and response types.

If this is the case, then probably best to do it.


In terms of your other comment:

The shared types being used in each domain should match the actual domain types that are used inside the TypeScript source code. At least the types that we want to transfer over the wire.

I completely agree with this though.

CMCDragonkai commented 2 years ago

I think most wrappers may be "thin" wrappers around common types. But the existence of unique wrappers allows extensibility for the future.

CMCDragonkai commented 2 years ago

Currently our error handling over GRPC should be documented and investigated to see how it would work if mapped to HTTP API.

CMCDragonkai commented 2 years ago

The src/errors.ts currently has a ErrorUndefinedBehaviour that should be moved somewhere else. It is currently used when the exception returned from the GRPC server is unknown. It doesn't make sense to have this at the top level src/errors.ts.

CMCDragonkai commented 2 years ago

While reviewing the CLI Authorization Retry Loop MR, we discovered the need to redo our integration of session management into the grpc domain. Discussion is here: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213#note_728972375

This led us to explore the usage of client side and server side interceptors. Atm, grpc-js doesn't support server side interceptors, but does support client side interceptors.

This library https://github.com/deeplay-io/nice-grpc came up again as they have already built on top of grpc-js and implemented client side and server side interceptor/middleware.

It also however does all the automation of grpc into promises or async generators that we did ourselves in grpc/utils.ts.

While a temporary solution will be used for session management for now, this can be a valid direction for moving ahead with GRPC and thus replacing our own hacks on top of GRPC.

The only issue is our hack on grpc-js which monkey patches the http2 server in GRPCAgent in order to replace its TLS logic with our certificate verification that uses our root keys. This only affects the client service as the agent service relies on the networking domain to do the TLS logic. We would have to investigate how easy it is to monkey patch the HTTP2 server if using nice-grpc. Otherwise we can lift some code from them and port it over to our system.

CMCDragonkai commented 2 years ago

Drafting out the ideal session management architecture:

grpc session management

CMCDragonkai commented 2 years ago

I noticed the usage of:

/**
 * Generic Message error
 */
class ErrorGRPCInvalidMessage extends ErrorGRPC {
  exitCode: number = 70;
}

Which is used in client/rpcGestalts:

» ~/Projects/js-polykey/src
 ♖ ag 'ErrorGRPCInvalidMessage'  ⚡(gprcsessions) pts/5 19:05:14
grpc/errors.ts
32:class ErrorGRPCInvalidMessage extends ErrorGRPC {
47:  ErrorGRPCInvalidMessage,

client/rpcGestalts.ts
259:            throw new errors.ErrorGRPCInvalidMessage(
297:            throw new errors.ErrorGRPCInvalidMessage(
340:            throw new errors.ErrorGRPCInvalidMessage(
378:            throw new errors.ErrorGRPCInvalidMessage(

This is a domain specific error and should come from src/gestalts/errors.ts.

GRPC errors should only for GRPC specific errors, and these errors are not meant to be thrown by RPC handlers.

CMCDragonkai commented 2 years ago

I've resolved ErrorUndefinedBehaviour that is actually legitimate. It is now:

/**
 * This is a special error that is only used for absurd situations
 * Intended to placate typescript so that unreachable code type checks
 * If this is thrown, this means there is a bug in the code
 */
class ErrorPolykeyUndefinedBehaviour extends ErrorPolykey {
  description = 'You should never see this error';
  exitCode = 70;
}

Will be in the CARL MR https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213

CMCDragonkai commented 2 years ago

Error Handling and Trailing Metadata and Status & Message

@tegefaulkes @joshuakarp very important!

Further investigation into gRPC internals has been done in the realm of metadata, error handling, our use of serialising exceptions via trailing metadata, and the StatusObject. Progress log here: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213#note_733965832

The situation is that there is leading and trailing metadata.

Client side can only send leading metadata, we are using this to send the session token.

Server side receives the leading metadata and makes use of it to check the session token.

Server side can then respond with leading metadata or trailing metadata. In this case, we use leading metadata to respond with new session token.

Trailing metadata is currently be used to serialise our exceptions. However trailing metadata may need to be used for other things in the future. To solve this, the error key values that is currently in the trailing metadata should be placed in a encoded JSON under error key in the metadata. Additionally we should making use of the status code and details property in StatusObject.

For example, the current code is:

/**
 * Serializes ErrorPolykey instances into GRPC errors
 * Use this on the sending side to send exceptions
 * Do not send exceptions to clients you do not trust
 */
function fromError(error: errors.ErrorPolykey): ServerStatusResponse {
  const metadata = new grpc.Metadata();
  metadata.set('name', error.name);
  metadata.set('message', error.message);
  metadata.set('data', JSON.stringify(error.data));
  return {
    metadata,
  };
}

This should be changed to something like (pseudo code):

metadata.set('error', error.toJSON());

Note that ErrorPolykey instances all have toJSON() method.

This however can only be sent on the client service.

In our agent service, we must not send exceptions over because they may contain sensitive data. This means we should also be making use of the code and details.

Note:

export declare type ServerStatusResponse = Partial<StatusObject>;
export interface StatusObject {
    code: Status;
    details: string;
    metadata: Metadata;
}
export declare enum Status {
    OK = 0,
    CANCELLED = 1,
    UNKNOWN = 2,
    INVALID_ARGUMENT = 3,
    DEADLINE_EXCEEDED = 4,
    NOT_FOUND = 5,
    ALREADY_EXISTS = 6,
    PERMISSION_DENIED = 7,
    RESOURCE_EXHAUSTED = 8,
    FAILED_PRECONDITION = 9,
    ABORTED = 10,
    OUT_OF_RANGE = 11,
    UNIMPLEMENTED = 12,
    INTERNAL = 13,
    UNAVAILABLE = 14,
    DATA_LOSS = 15,
    UNAUTHENTICATED = 16
}

The default is UNKNOWN if it is not passed in. That's why toError checks for the UNKNOWN code.

In the case of agent service, it should be sufficient to only set status and details as we are returning an error to an unknown party. We would need to use one of the existing status codes above and details there. Further details could be placed in the error property of Metadata. However this cannot just be error.toJSON() as it contains sensitive information like data and stack.

What we can do instead is have a "simplified" JSON output which only contains the name, description, message and exitCode.

  toJSON(): string {
    return JSON.stringify({
      name: this.name,
      description: this.description,
      message: this.message,
      exitCode: this.exitCode,
      data: this.data,
      stack: this.stack,
    });
  }

This would allow the caller on agent service to get name, description, message and exit code but nothing else. Would this be sufficient to return validation errors? It seems not, since we would at least require structure. In such a case, data is necessary, but these exceptions would have to be carefully constructed and verified not to output any sensitive information. Even the message must not. To ensure this is the case, we would need to carefully document the usage of message and data so that they are not allowed to contain sensitive data for exceptions.

Need to think about how to structure this so there's no mistakes can be made and it is foolproof.

In order to achieve such a thing, some change are need to be made to the grpc/utils.ts. Refer to these 2 quotes from the progress log.

However the usage of the generators that convert the streams right now has no provision for the metadata. You would have to directly use call.end afterwards. And they are calling stream.end say for example generatorWritable(). This means we should expose the usage of metadata.

This might mean that if you do: gen.next(null) which currently ends the generator and stream would have to be changed to gen.next(meta); Thus if a metadata is sent, that means the stream has to end with the metadata. We may therefore enable both types to mean this: gen.next(null | Metadata);. So this is one change that needs to be done.

The location of changes is generatorWritable in src/grpc/utils.ts where we use stream.end() when vW === null and generatorDuplex in src/grpc/utils.ts where we use gW.next(null) when vW === null.

An ideal way of receiving metadata on client side can be:

const pCall = client.unary(m);
const leadingMeta = await pCall.metadata();
const respMsg = await pCall;

or

const genDuplex = duplexStream();
const leadingMeta = await genDuplex.metadata();

This avoids having to go into the internal property of call or stream and then attaching an event handler.

However if we implement this where it only resolves when metadata is received, this can be a problem. Because then the resolution will never occur. Instead, a rejection must occur for such a promise if the response is finished but there's no initial metadata.

It may also be that metadata is not a function, but a promise property. This is because the event handlers might need to be added immediately. But I wonder what would happen if the stream is started and the event handler is added afterwards? It's safe to use pCall.metadata or genDuplex.metadata then using it as a function that has to attach the event handlers afterwards.

This only solves the problem for leading metadata but not trailing metadata.

If the previous post is correct in that trailing metadata is being used for our exceptions, then we may just have to avoid relying on trailing metadata. But a better option would be to namespace our trailing metadata under a error prefix, note that nested data is not allowed in metadata. But you can provide an encoded string or buffer. So if we were to use error keyed to a POJO JSON and then decode that for our toError and fromError, that might free up the trailing metadata for other uses later in the future.

CMCDragonkai commented 2 years ago

All calls must have deadlines, otherwise we can hang forever.

All GRPC calls support CallOptions.

Which has:

export interface CallOptions {
    deadline?: Deadline;
    host?: string;
    parent?: ServerUnaryCall<any, any> | ServerReadableStream<any, any> | ServerWritableStream<any, any> | ServerDuplexStream<any, any>;
    propagate_flags?: number;
    credentials?: CallCredentials;
    interceptors?: Interceptor[];
    interceptor_providers?: InterceptorProvider[];
}

Where Deadline is:

export declare type Deadline = Date | number;

This is the same kind of deadline as we have with client timeouts when waiting for ready.

These deadlines are on the client side and ensure we have a deadline for when the call is expected to finish.

This requires some testing as to how it works when applied to streams. Do they apply to the entire stream life time, or to per-messages (like a keep alive timeout) to the stream?

Deadlines are important to prevent us with dangling calls taking up resources when the other side drops.

This only solves deadlines on the client side, but not deadlines on the server side. Especially with streaming we need to have deadlines on server side to prevent long-running useless call streams.

Read this as well: https://grpc.io/blog/deadlines/ It appears that it's possible to know from the server side to see if the client has exceeded the deadline and then decide to cancel the call.

CMCDragonkai commented 2 years ago

The server side can use call.getDeadline() to see what the deadline is that the client specified. However this doesn't "set" a deadline for the server side call.

Consider the problem of the client stream hanging around and not ending the stream, one must then eventually timeout here and close the connection. This may be something we should do in an server side interceptor (which currently doesn't exist).

joshuakarp commented 2 years ago

Work on deadlines will also relate to #243, notably with a bidirectional stream (note that for node claims, it's strictly between agents).

CMCDragonkai commented 2 years ago

Regarding deadlines, it's possible to do it globally with a interceptor. You just add the deadline to the call options on every call. Of course this may impact long-running streams. It is also possible to filter by calls when using the interceptor. The docs on client interceptors goes into more detail about this https://github.com/grpc/proposal/blob/master/L5-node-client-interceptors.md and also the docs in https://grpc.github.io/grpc/node/module-src_client_interceptors.html. I used these resources to code up the session interceptor.

For long-running streams, we may want to do something where we don't want to put a deadline on it unless the stream is not active. A sort of keep-alive timeout instead of an overall call deadline.

But this is going to require a case by case analysis. It's something we need to spec out in detail.

CMCDragonkai commented 2 years ago

Deadlines regarding node connection timeouts had some discussion here: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213#note_739641342. @joshuakarp please incorporate that into your nodes work too.

joshuakarp commented 2 years ago

Are deadlines necessary for NodeConnection timeouts here? I was under the impression that this was what the GRPCClientAgent.timeout was for. Doesn't a deadline just add another "timeout" on top of this?

CMCDragonkai commented 2 years ago

Timeouts from Node connection should propagate for the GRPCClientAgent. So any timeout specified in NodeConnection should directly be passed into the construction of GRPCClientAgent.

CMCDragonkai commented 2 years ago

In the future when IPV6 is available, usages of 0.0.0.0 defaults should be changed to dual stack specifier :: as default as it binds to all hosts for IPv4 and IPv6. This is primarily useful for public bindings.

For localhost, ::1 is used for IPv6 while 127.0.0.1 is still IPv4. I believe we stick to 127.0.0.1 for localhost binding instead of ::1. Mainly because if we're using localhost it's more common to use IPv4. Later this may become dual stack anyway.

CMCDragonkai commented 2 years ago

An issue came up in https://github.com/MatrixAI/js-polykey/pull/278#issuecomment-979730714 where we were throwing errors when the provider id was not one of the providers in our list.

There's many ways to do this:

  1. Firstly, if many functions we use undefined return value to indicate the idea that something doesn't exist, we don't always want to throw an exception
  2. Catching exceptions is verbose, exceptions should be "real" errors, not just general course

Because RPC is meant to be network calls. And in our normal functions we are returning undefined for when something doesn't exist. It makes sense to have our RPC call to also return a type which could be something that doesn't exist. The way that GRPC does this is through the oneOf specification. https://developers.google.com/protocol-buffers/docs/proto3#oneof

This would be similar to how in TS, we are using x | undefined as the return type.

There are number of "sentinel values" that we can use to indicate that something doesn't exist:

Of course in other cases we should throw an exception when something that is expected to exist doesn't exist. The design of this requires a case by case analysis until we can form a general guideline.

CMCDragonkai commented 2 years ago

Due to our asynchronous GRPC client interceptor, adding deadlines can cause problems in our GRPC calls. Upstream GRPC is fixing this to making sure that the asynchronous ops are actually awaited for by awaiting the starting outbound interceptor. Details are here: https://github.com/MatrixAI/js-polykey/pull/296#issuecomment-989580750

This means right now, our deadlines have to be at least 1 second long... but even this is not a robust solution. Best case scenario is to get the upstream fix, and upgrade our GRPC version. Lots of changes though, so it will require some robust testing to ensure that our existing GRPC usage is still working.

CMCDragonkai commented 2 years ago

Based on my testing in #296, I found out that when the deadline is exceeded, we end up with a local status error called GRPC_STATUS_DEADLINE_EXCEEDED. This is received by onReceiveStatus in the client interceptor.

This can occur before the server side even has a chance to respond with leading metadata. This results in an exception being thrown in the GRPC call, which is of course caught by our root exception handler. Note that this has not been encapsulated into our GRPCClient and therefore isn't part of our grpcErrors hierarchy yet. This is something we would want to incorporate into our application properly as deadline failures may be caught by our nodes domain in order to initiate retries. Should be part of over design overhaul in #225.

CMCDragonkai commented 2 years ago

Until the upstream fixes asynchronous interceptors, it's possible for concurrent GRPC calls to run without refreshing the token.

await pkClient.grpcClient.doX();
await pkClient.grpcClient.doY();
// at this point only doX and doY interceptors will finish
// unless there is more work that is scheduled ahead in the event loop

Not a huge issue right now due to the use of FlowCountInterceptor, but can be dealt with later.

Details here: https://github.com/MatrixAI/js-polykey/pull/296#issuecomment-989740159

CMCDragonkai commented 2 years ago

Coming from fixing the pk identities authenticate call: https://github.com/MatrixAI/js-polykey/pull/278#issuecomment-996403113

There are 3 possible "timeout" events that our GRPC protocol needs to handle:

  1. The client put a deadline https://github.com/MatrixAI/js-polykey/issues/249#issuecomment-992001975 on the unary call or otherwise stream, and the deadline was exceeded.
    • The client side receives an exception either on the unary call handler, or on any stream error handler.
    • The server side can see that the deadline is exceeded. I don't remember how it sees it, whether by event or due to a boolean condition. But it should see this.
  2. The client explicitly cancels the call/stream like:
  3. Neither deadline nor cancellation occurs, but the server side is hanging on either:
    • sending a message and expecting the message to be received
    • waiting for a message (this is not possible with unary call or server stream, only client stream or duplex stream)
    • in this case, the server side should have its own deadline, and then throw an exception
    • note that only server side can throw exceptions to the client, and only client side can cancel, this means cancellation from server side is about throwing exceptions

Right now none of these 3 situations are being handled. This means our GRPC server can be deadlocked or experience resource starvation. Really without a way to abort asynchronous operations, we also have resource leaks.

Example of this is:

Issues that is relevant to this are:


I believe events 1 and events 2 are the same to the server side. The server side sees both cancellation events. This is due to: https://grpc.io/blog/deadlines/#checking-deadlines

Which should mean when event is cancelled, the onReceiveStatus should also be called.

CMCDragonkai commented 2 years ago

As a side note about Protobuf Map.

https://github.com/MatrixAI/js-polykey/pull/278#issuecomment-996395664

The map used in protobuf is not a ES6 map. It's google-protobuf library's own map. The documentation is here:

To convert it to a POJO:

// this is the closest way to do it, the X is whatever the field name is
Object.fromEntries(message.getXMap().entries());