Questions about expected performances

Hi, GoLang newbie here, I've recently started to explore the gRPC world and this is much more a question than a issue, I hope I'm using the proper channel.

I have come in touch with your project looking for an "in-process" gRPC interaction, that occurs to be provided by your inprocgrpc.Channel transport.

Previously I have being experimenting with the standard gRPC interaction over Unix sockets, comparing the performance between the gRPC invocation and a plain direct method invocation, that is calling

foo.GetInfo() // just a function of the foo struct

versus

grpcfoo.GetInfo(context.Background()) // will call foo.GetInfo() in turn

GetInfo() just returns a Info message, generated from my proto file.

I made the comparison on 10K iterations: it took ~1s with gRPC over a Unix socket vs ~0.2ms with the direct function call, that is 5000 times slower.

Than I did the same with your in-process transport:

grpchanfoo.GetInfo(context.Background())

the 10K iterations took ~80ms, thus 400 times slower than a direct function call (and 12.5 times faster than standard gRPC over Unix socket).

I just would like to know if you find this outcomes reasonable and expected, and also if you think that a different kind of transport may push the performance even further, nearer to the ones of a direct function call.

Thanks in advance,

Roberto

While it may be possible to push the envelope a little further with targeted optimization work in this library, it will necessarily still be an order of magnitude or so slower than a direct function call.

One of the aims of this library is that an in-process call is more-or-less indistinguishable from a network call, functionally speaking. That allows you to write client code that is agnostic to whether the channel is a network socket or an in-process channel.

To this end, requests must be cloned, so that servers do not share the same reference as clients. Otherwise, client code that re-uses the request message in a loop, where a goroutine is spawned for each loop iteration (e.g. forking off many parallel RPCs), will fail unpredictably due to incorrect memory sharing.

Also, timeouts and cancellations cannot be properly enforced with a direct function call: the operation will necessarily block until the server notices the context expiry and cooperatively unwinds the stack. So, to work around that, this library runs the server function call in a separate goroutine, so that the client can terminate early if/when the context ends. The act of running the function call in a separate goroutine and communicating via channels necessarily imposes much more overhead than just a function call. Also, streaming calls are non-blocking -- so even if you wanted to reduce the overhead and just do direct function calls, this wouldn't be possible for streaming methods.

So the main opportunities for improvement in this package are in how the messages are cloned and in the inter-thread communication. And the latter is very hard to optimize without resorting to decidedly non-idiomatic-Go code (which would make the library harder to maintain IMO... and the in-process channel is probably already complicated enough).

If you really need more performance and are willing to give up the channel's fidelity compared to network calls, you could create your own channel implementation that delegates to an in-process channel for streaming calls but just directly calls the relevant handler for unary ones. For the streaming calls, you can configure the channel to use a no-op cloner (i.e. avoid copying the request). Note that if you skip copying the request, you'll want to make sure the code is well-tested with the race detector enabled so you can catch accidental sharing.

Thanks a lot for the quick, informative and exhaustive answer.

Firstly I want to clarify that I absolutely didn't intend to make a criticism to the job you have done: I think grpchan is a great piece of software that the gRPC official team should consider to include as part of the reference implementation: it's a smart component, chapeau.

Your feedback has clarified things to me and lead me to dig deeper into gRPC's fundamentals (I'm at the first steps, as I said): I now get the complexity of emulating a network interaction (with asynchrony, timeouts, single-side cancellation, streams and so on) in a non-network environment, the performance impact is inevitable and it's the price for an agnostic interface.

Nonetheless I found your closing contribute very interesting: I would benefit a lot of specially handling the sub-case where the desired behavior mostly resembles a straight simple function invocation. I'm thinking of a fork where a custom channel option (like "UnaryAsFuncCalls") can be passed during the in-process channel creation so that unary calls will bypass all the "network emulation" behaviors (ignoring any inapplicable setting, like timeouts and so on) and go direct to the handler, leaving all the other network-like interactions intact. Streams can be left untouched, since you generally don't create many of them in a loop that needs to be quick: it's typical to create the stream once and than waiting for (or sending) the data flow.

Our business case (from where my search started) is that of UUIDs generator service: we want it to be ubiquitous over the network and language agnostic, thus the choice of gRPC. Then we planned to use that service inside a DB server machine, providing UUIDs for storage purposes: previously those were generated through a library, with a simple "genUUID()" direct invocation, and obviously the throughput degraded evidently when we switched from the library to the locally deployed service (exposed through a Unix socket and wrapping the stated library in turn). As you can guess, when we call "genUUID()" we don't expect a possible timeout or any other special behavior: we just want the UUID asap. I understand that this breaks the desired abstraction, but it would be extremely useful in concrete, as it would actually make the service totally adaptive on the performance side, eliminating the need of maintaining a separate library with the same identical semantic.

I'd like to know you opinion on the matter and, possibly, to have your indications on where to act on the code to implement the above.

Thanks in advance.

@IOOOTARobby, sorry it took me so long to get back to this question.

To be honest, if performance is that big of an issue, instead of using a gRPC stub and an in-process channel, you must have this code directly invoke the implementation functions. The main value for using an in-process channel as the means for sharing the implementation is the flexibility it provides, mostly for a few scenarios that I can think of:

You could move the service out of process in the future and not have to change the client code, since it already uses a gRPC stub. All you'd have to do is to change the application "wiring" code, that constructs the stub.
You can handle dynamic routing, like in a client interceptor, where the call could be handled in-process for some requests and sent on the wire for others. This simplifies application code since call sites can be de-coupled from the routing rules. (This could be used to implement affinity, for example.)
You can easily provide test implementations without having to spin up test servers, like for unit tests and the like.

But if you don't require any of those options, you might skip the gRPC stub entirely and directly call the implementation function.

Does that make sense?

I'd like to know you opinion on the matter and, possibly, to have your indications on where to act on the code to implement the above.

If you were to fork this code to make unary calls perform a direct function invocation, you'd be changing the Invoke method on the channel so that it calls the method description's handler directly, instead of in a goroutine. The goroutine is currently spawned here, and the handler invoked here. This would make the rest of this function simple since this function no longer needs to use channels to communication with the goroutine that spawned it.

fullstorydev / grpchan

Questions about expected performances #30