dialohq / ocaml-grpc

gRPC library for OCaml
https://dialohq.github.io/ocaml-grpc
BSD 3-Clause "New" or "Revised" License
62 stars 9 forks source link

server crashes when the client disconnects unexpectedly #34

Open joprice opened 1 year ago

joprice commented 1 year ago

I'm running the EIO server and client examples, and noticed that when I kill the client, the server crashes with the following error:

End_of_file Raised at Eio_linux__Low_level.readv.(fun) in file "lib_eio_linux/low_level.ml", line 158, characters 25-52
+Called from Eio_unix__Rcfd.use in file "lib_eio/unix/rcfd.ml", line 161, characters 10-14
+Re-raised at Eio_unix__Rcfd.use in file "lib_eio/unix/rcfd.ml", line 166, characters 6-41
+Called from Eio__Flow.single_read in file "lib_eio/flow.ml", line 16, characters 12-27
+Called from Gluten_eio.IO_loop.read_once.(fun) in file "eio/gluten_eio.ml", line 54, characters 10-45
+Called from Gluten_eio.IO_loop.read_once in file "eio/gluten_eio.ml", line 51, characters 4-192
+Called from Gluten_eio.IO_loop.read in file "eio/gluten_eio.ml", line 60, characters 10-31
+Called from Eio__core__Fiber.any.(fun).wrap in file "lib_eio/core/fiber.ml", line 97, characters 16-20
+Re-raised at Eio__core__Fiber.any in file "lib_eio/core/fiber.ml", line 138, characters 26-61
+Called from Gluten_eio.IO_loop.start.(fun).read_loop_step in file "eio/gluten_eio.ml", line 126, characters 21-44
+Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
+Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 150, characters 4-21
+Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 116, characters 8-12
+Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 118, characters 32-40
+Called from Dune__exe__Grpc_server.(fun).loop in file "grpc_test/grpc_server.ml", line 125, characters 12-28

I'm not sure if I'm missing somewhere I'm intended to handle this, or whether connection handling should be set up a different way to be more resilient, or if this is unexpected behavior.

quernd commented 1 year ago

Thanks for reporting this. I haven't yet been able to reproduce it; what are the exact steps to trigger this error? Just Ctrl-C the client during a streaming call?

Since this exception is not caught by Gluten: which specific version of Gluten are you using? There have been a couple of commits recently in the Gluten main branch that affect exception handling.

joprice commented 1 year ago

I'm using the latest. I tried downgrading to a version before those changes were introduced, but had trouble finding a set of commits across h2-eio and gluten and a few other deps in my project that were compatible. Which versions are you building with?

joprice commented 1 year ago

I was just able to get it to build. I had to downgrade Eio to 0.10, along with gluten=0.4.1 and h2=0.10.0. Just tested with those and no longer see the error on disconnection.

joprice commented 1 year ago

As far as the server itself shutting down, this seems to be my misunderstanding of the lifetime of the new switch argument provided to create_connection_handler. Introducing a new switch in the anonymous function returned from connection_handler in the example allows the server to survive the exception:


 fun socket addr ->
   Eio.Switch.run (fun sw ->
       H2_eio.Server.create_connection_handler ~sw ?config:None
         ~request_handler ~error_handler addr socket)

But I'm new to eio and not sure that's correct.

joprice commented 1 year ago

I hit another case which I'm not sure yet is related to the recent gluten changes. I have to try to reproduce it in the reduced version with the older versions. I am now trying to terminate the server while the client is querying in a loop. What I see is the error handler gets called with Malformed_response and then it hangs.

joprice commented 1 year ago

I added some traces and looks like the code hangs in get_response_and_bodieswaiting for the response body when an error is hit: let response = Eio.Promise.await response in. Maybe the error handler needs to be taken into account and fail the promises?

joprice commented 1 year ago

I experimented with this here and it does fix the hanging: https://github.com/dialohq/ocaml-grpc/compare/main...joprice:ocaml-grpc:clientErrors.

quernd commented 1 year ago

I was just able to get it to build. I had to downgrade Eio to 0.10, along with gluten=0.4.1 and h2=0.10.0. Just tested with those and no longer see the error on disconnection.

Yeah, it can be tricky to find a set of dependency versions that work together. That's mostly because there isn't yet a version of H2 on OPAM that works with Eio 0.11. We enforced an upper bound in this commit but that's just for the examples, it seems.

In any case, if that error doesn't present with gluten 0.4.1, it may be worth opening an issue in the gluten repository to ask for clarification if this behavior is intended. Another thing that would be good to know is: are you able to reproduce the error only in conjunction with gRPC? From the backtrace you posted it looks to me like it would happen with any HTTP2 connection.

quernd commented 1 year ago

I experimented with this here and it does fix the hanging: main...joprice:ocaml-grpc:clientErrors.

That looks reasonable! Would you mind opening a PR?

joprice commented 1 year ago

@quernd lost track of the in my inbox. I can put up a pr.