dialohq / ocaml-grpc

gRPC library for OCaml
https://dialohq.github.io/ocaml-grpc
BSD 3-Clause "New" or "Revised" License
58 stars 9 forks source link

Fix trailers race condition #26

Closed doctor-pi closed 1 year ago

doctor-pi commented 1 year ago

We recently noticed that we were getting Status = Unknown with the default message "Server did not return grpc-status", when a server was in fact responding properly with a non-default value.

We now reproduced the issue in both grpc-async and grpc-lwt, using a variation of the greeter client example, with repeated requests. It seems that we randomly fail to get the real status, and we are filling in the default.

This seems to be a race condition due to the way this is implemented, by checking if a future is filled when in fact it is simply not filled yet.

How to reproduce

See branch reproduce/unknown-status-issue for a reproduction of the issue. (https://github.com/dialohq/ocaml-grpc/tree/reproduce/unknown-status-issue.)

Run this in one terminal:

Then run either of the following and it should blow up at some random (numbered) message:

Check solution

See branch test/unknown-status-issue, which contains the same commits that are in this branch (the solution), on top of the code to reproduce the issue. (https://github.com/dialohq/ocaml-grpc/tree/test/unknown-status-issue.)

Run the same commands as above and after 100_000 iterations the programs should terminate without error. (Feel free to remove or increase the limit, I tested without it but it's convenient.)

Assumptions

We assume that trailers only responses are not an issue, and we should never hang when there is no body in the response. This is consistent with expectations of HTTP2, so any hangs would probably indicate an issue with the underlying HTTP2 library and should probably be addressed there.

Should be tested with the etcd example, see https://github.com/dialohq/ocaml-grpc/issues/1. Could it be an etcd / proxy issue? I think there might be a missing trailer handler, but the situation needs some confirmation.

Update (13-Apr-2023)

Refactored the code a bit.

quernd commented 1 year ago

Thanks, great catch! This definitely makes sense. I will test it for our main use case (streaming).

quernd commented 1 year ago

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

doctor-pi commented 1 year ago

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

You are right. I just tested in our codebase as well, and it hanged on a non-existing method.

I'll also have a look.

doctor-pi commented 1 year ago

It works well for our use case, however your intuition to check with the etcd example was right because this change does indeed seem to introduce a regression (see #1, e.g. when dialing a non-existing method it will hang again). That's because we still need to get the gRPC status from the headers when the body is empty. I will investigate some more tomorrow, thanks again for reporting and the reproduction!

I adapted the solution, and at least in my test it won't hang when the method is changed to a non-registered one. For example ~rpc:"SayHello" -> ~rpc:"SayHello1" will now result in an error with 404 status.

The server does not return gprc headers in that case so I may have to refine the solution a bit, let me know how it works for you.

doctor-pi commented 1 year ago

@quernd i suspect the etcd example might behave as expected now. (After the latest commit: fefc3d8)

doctor-pi commented 1 year ago

We should probably also probe for the presence of grpc-status in the response.status = Ok case, before waiting for the trailers ivar which blocks.

I added a commit that implements this solution.

quernd commented 1 year ago

Thanks again @doctor-pi, it all looks good in my tests now. I'll review and merge soon.

doctor-pi commented 1 year ago

It looks good to me. Thanks again for your contribution!

Nice. I'll merge.

doctor-pi commented 1 year ago

It looks good to me. Thanks again for your contribution!

I guess I should have squashed the commits?

quernd commented 1 year ago

I guess I should have squashed the commits?

Don't worry about that, we don't have any such policy (yet).