PlatformLab / grpc_homa

Allows Homa to be used as a transport with gRPC.
25 stars 5 forks source link

CompletionQueue::Next returns OK after the channel is broken #12

Closed mikehb closed 10 months ago

mikehb commented 10 months ago

https://github.com/PlatformLab/grpc_homa/files/12864623/homa.tar.gz

In the program above, cq_->Next(&tag, &ok) returns success 'ok' in server side after client is closed.

Print the ok variable

void GrpcRtkService::WatchTextMessageHandler::doProceed(bool ok, void *tag) {
    std::cout << "ok = " << ok << std::endl;

execute server and client, and shutdown client,

[mikehuang@bogon homa]$ ./bazel-bin/server -homa
0
ok = 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
ok = 1
register text message client 127.0.0.1:32785
ok = 1
43
ok = 1
44
ok = 1
45
ok = 1
46
ok = 1
47
ok = 1
48
ok = 1
49
ok = 1
50
ok = 1
51
ok = 1
52
ok = 1
53
ok = 1
54
ok = 1
55
ok = 1
56
ok = 1
E1013 16:21:58.578004464   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 105864): Transport endpoint is not connected
57
E1013 16:21:59.100186795   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 105904): Transport endpoint is not connected
ok = 1
58
E1013 16:22:00.101237824   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 105950): Transport endpoint is not connected
ok = 1
59
E1013 16:22:01.113876504   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 105980): Transport endpoint is not connected
ok = 1
60
E1013 16:22:02.115967771   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 106018): Transport endpoint is not connected
ok = 1
61
E1013 16:22:03.127742514   17385 homa_incoming.cc:185]                 Error in recvmsg (homaId 106066): Transport endpoint is not connected
ok = 1
62

Another thing is that grpc's AsyncNotifyWhenDone could be used to detect broken channel, and I facilitate it to remove (UnregisterTextMessageHandler) the broken client. Not sure whether grpc_homa supports it, please also test this feature once this bug is fixed.

mikehb commented 10 months ago

this is the grpc Next specification,

  /// Read from the queue, blocking until an event is available or the queue is
  /// shutting down.
  ///
  /// \param[out] tag Updated to point to the read event's tag.
  /// \param[out] ok true if read a successful event, false otherwise.
  ///
  /// Note that each tag sent to the completion queue (through RPC operations
  /// or alarms) will be delivered out of the completion queue by a call to
  /// Next (or a related method), regardless of whether the operation succeeded
  /// or not. Success here means that this operation completed in the normal
  /// valid manner.
  ///
  /// Server-side RPC request: \a ok indicates that the RPC has indeed
  /// been started. If it is false, the server has been Shutdown
  /// before this particular call got matched to an incoming RPC.
  ///
  /// Client-side StartCall/RPC invocation: \a ok indicates that the RPC is
  /// going to go to the wire. If it is false, it not going to the wire. This
  /// would happen if the channel is either permanently broken or
  /// transiently broken but with the fail-fast option. (Note that async unary
  /// RPCs don't post a CQ tag at this point, nor do client-streaming
  /// or bidi-streaming RPCs that have the initial metadata corked option set.)
  ///
  /// Client-side Write, Client-side WritesDone, Server-side Write,
  /// Server-side Finish, Server-side SendInitialMetadata (which is
  /// typically included in Write or Finish when not done explicitly):
  /// \a ok means that the data/metadata/status/etc is going to go to the
  /// wire. If it is false, it not going to the wire because the call
  /// is already dead (i.e., canceled, deadline expired, other side
  /// dropped the channel, etc).
  ///
  /// Client-side Read, Server-side Read, Client-side
  /// RecvInitialMetadata (which is typically included in Read if not
  /// done explicitly): \a ok indicates whether there is a valid message
  /// that got read. If not, you know that there are certainly no more
  /// messages that can ever be read from this stream. For the client-side
  /// operations, this only happens because the call is dead. For the
  /// server-sider operation, though, this could happen because the client
  /// has done a WritesDone already.
  ///
  /// Client-side Finish: \a ok should always be true
  ///
  /// Server-side AsyncNotifyWhenDone: \a ok should always be true
  ///
  /// Alarm: \a ok is true if it expired, false if it was canceled
  ///
  /// \return true if got an event, false if the queue is fully drained and
  ///         shut down.
  bool Next(void** tag, bool* ok) {
    // Check return type == GOT_EVENT... cases:
    // SHUTDOWN  - queue has been shutdown, return false.
    // TIMEOUT   - we passed infinity time => queue has been shutdown, return
    //             false.
    // GOT_EVENT - we actually got an event, return true.
    return (AsyncNextInternal(tag, ok, gpr_inf_future(GPR_CLOCK_REALTIME)) ==
            GOT_EVENT);
  }
johnousterhout commented 10 months ago

First of all, thanks for reporting all of these issues. Unfortunately I haven't been able to find precise documentation for exactly how a gRPC transport like Homa should behave, so I only find out about missing features when someone tries something new and it doesn't work. I appreciate all of your reports and apologize for the inconvenience.

This one is going to take a bit longer to fix because it involves a bunch of changes (including some to Homa). It will probably be mid to late next week before I can get all the fixes in place.

mikehb commented 10 months ago

It's fine, there is no hurry.

johnousterhout commented 10 months ago

I think I've got this fixed now (server-side error handling was woefully inadequate). I believe that AsyncNotifyWhenDone also works now. This fix required a change to the Homa transport as well, so you'll need to pull the latest Homa sources as well as grpc_homa. I'm going to close the issue, but let me know if you find additional problems.

Thanks for reporting the problem.

mikehb commented 10 months ago

The latest code works for me, thanks