Client/server networking abstractions

ReubenBond commented 5 years ago

Epic https://github.com/aspnet/AspNetCore/issues/10869

The goal of this issue is to specify abstractions which the .NET community can use to build robust, correct, high-performance client & server applications. This is a follow-up to #4772 (Project Bedrock) and is largely a summary of a discussion with @davidfowl, @anurse, @halter73, @shirhatti, & @jkotalik.

Initiating & accepting connections

Clients initiate connections to an endpoint using IConnectionFactory. Servers bind to an endpoint using IConnectionListenerFactory and then accept incoming connections using the IConnectionListener returned at binding time.

public abstract class ConnectionContext : IAsyncDisposable
{
  // unchanged except for IAsyncDisposable base
}

public interface IConnectionListenerFactory
{
    ValueTask<IConnectionListener> BindAsync(System.Net.EndPoint endpoint, CancellationToken cancellationToken = default);
}

public interface IConnectionListener : IAsyncDisposable
{
    EndPoint EndPoint { get; }
    ValueTask<ConnectionContext> AcceptAsync(CancellationToken cancellationToken = default);
    ValueTask UnbindAsync(CancellationToken cancellationToken = default);
}

public interface IConnectionFactory
{
    ValueTask<ConnectionContext> ConnectAsync(System.Net.EndPoint endpoint, CancellationToken cancellationToken = default);
}

A framework might want to support multiple different transports, such as file + socket. In that case, it can use multiple IConnectionListenerFactory implementations.

The shared abstractions should not dictate how these factories are are specified or configured: frameworks should continue to follow a framework-specific builder pattern. Eg, KestrelServerOptions has List<ListenOptions> where each entry is a different endpoint to listen on + middleware to handle connections to that endpoint.

Both IConnectionListenerFactory.BindAsync & IConnectionFactory.ConnectAsync accept a System.Net.EndPoint to bind/connect to. The BCL has EndPoint implementations for IPEndPoint, DnsEndPoint, and UnixDomainSocketEndPoint. Users can add new transports which accept custom sub-classes of EndPoint, eg MyOverlayNetworkEndPoint.

Servers bind to endpoints using IConnectionListenerFactory, resulting in a IConnectionListener. The listener's AcceptAsync() is then called in a loop, each time resulting in a ConnectionContext instance which represent a newly accepted connection. When the server wants to stop accepting new connections, the listener is disposed by calling listener.DisposeAsync(). Connections which were previously accepted continue to operate until each is individually terminated.

For implementations of IConnectionListener which are backed by a callback system (eg, libuv), decoupling the lifetime of IConnectionListener and the (many) resulting ConnectionContexts poses a challenge. We need to consider whether or not this minimal interface allows graceful shutdown (including draining connections) to still be cleanly implemented in those cases.

Handling connections

The client/server decides how it handles the ConnectionContext. Typically it will be decorated (eg, via ConnectionDelegate) and pumped until the connection is terminated. Connections can be terminated by calling DisposeAsync() (from IAsyncDisposable) but can also be terminated out-of-band (eg, because the remote endpoint terminated it). In either case, clients/servers can subscribe to the ConnectionContext's IConnectionLifetimeFeature feature to learn about termination and take necessary action (eg, cleanup).

Connection metadata

Currently, IHttpConnectionFeature is used to expose local/remote endpoint information. We will add a new feature, IConnectionEndPointFeature as a uniform way to access this information.

public interface IConnectionEndPointFeature
{
  EndPoint RemoteEndPoint { get; set; }
  EndPoint LocalEndPoint { get; set; }
}

halter73 commented 5 years ago

Thanks!

@davidfowl and I discussed this. We think IConnectionListener.DisposeAsync() should either wait for all active connections to close naturally, or abort all the connections.

If we go the abort route, we'll probably need something similar to IConnectionListener.StopListening() to tell transports to close the listen socket. If we let the connections close naturally, we can probably just close the listen socket at the start of DisposeAsync().

davidfowl commented 5 years ago

I'm going to spike a prototype using these new APIs, I already have some design feedback from a hack session @halter73 and I did today but i'll try to make kestrel use it.

cc @tmds @mgravell

tmds commented 5 years ago

Great stuff!

ValueTask ConnectAsync(System.Net.EndPoint endpoint); ValueTask AcceptAsync();

Perhaps it makes sense to add a CancellationToken to these methods. ConnectAsync can take a long time. AcceptAsync, maybe not needed, IAsyncDisposable kind of acts as the CancellationToken.

@davidfowl and I discussed this. We think IConnectionListener.DisposeAsync() should either wait for all active connections to close naturally, or abort all the connections.

Maybe this can be moved to the server implementation. It simplifies the IConnectionListener implementation because it no longer needs to track the ConnectionContexts.

mgravell commented 5 years ago

Sounds promising (the goal; I haven't studied the specific API in depth), and it'll be a pleasure to be able to mark large chunks of Pipelines.Sockets.Unofficial as [Obsolete]. One big question I would raise early is: what is the assembly/namespace/package story here? since this issue is under aspnet not corefx, I have a sinking feeling... Basically, it is just really really weird if a client library or a desktop console needs to reference aspnetcore to talk to a socket... (whether as a client or a server)

benaadams commented 5 years ago

it is just really really weird if a client library or a desktop console needs to reference aspnetcore to talk to a socket...

AsyncSocketPower .NET ?

davidfowl commented 5 years ago

Sounds promising (the goal; I haven't studied the specific API in depth), and it'll be a pleasure to be able to mark large chunks of Pipelines.Sockets.Unofficial as [Obsolete]. One big question I would raise early is: what is the assembly/namespace/package story here? since this issue is under aspnet not corefx, I have a sinking feeling... Basically, it is just really really weird if a client library or a desktop console needs to reference aspnetcore to talk to a socket... (whether as a client or a server)

TL;DR, this design basically introduces a transport abstraction and makes it a bit more decoupled so it can be used standalone (without kestrel or the pipeline). Kestrel and SignalR become a consumer of this new API (like a pluggable TcpListener and TcpClient).

I see a couple of options around naming:

We keep the ASP.NET Core naming for both client and server and keep the connection abstractions package that existed in ASP.NET Core 2.1.
We make a big breaking change in 3.0 and introduce new naming (Microsoft.Extensions.Server or System.Net.) and maybe we even make a bigger breaking change to move the feature collection to a new assembly (or we make a new one).

Depends on how bad people feel about the current naming (I know how important it is) but the big picture here is that we want to create an ecosystem around these things, of middleware, servers, parsers ETC. I don't think the abstractions belongs in corefx because this is a tiny layer above that with more opinions. If I had to draw a comparison, I would say this is the ASP.NET Core but for the communication layer below not tied to a specific transport.

cc @DamianEdwards if he has any ideas

mgravell commented 5 years ago

maybe it is neither aspnet nor corefx?

corefx <=== (some comms naming thing) <=== aspnet

but: I appreciate that this may be impossible at this point. It is just... a bit messy. But coupling too tightly to aspnet also has impact re transitive dependency graph explosion, etc.

davidfowl commented 5 years ago

Updated the proposal based on the prototype:

Added CancellationToken to ConnectAsync (we can discuss AcceptAsync as well)
Added EndPoint to IConnectionListener

ReubenBond commented 5 years ago

@halter73

@davidfowl and I discussed this. We think IConnectionListener.DisposeAsync() should either wait for all active connections to close naturally, or abort all the connections.

Either way, this suggests that IConnectionListener owns the connections returned from AcceptAsync. There's nothing to stop a framework from tracking connections or even associating them with the IConnectionListener which they came from (if any). The requirement seems onerous, though.

Currently LibuvTransport owns the LibuvThread. Each LibuvConnection gets a reference to one of those LibuvThreads. Would reference counting be insufficient to ensure that each of those LibuvThreads are stopped when they're no longer needed either by the listener or one of the connections? I might still be misunderstanding.

EDIT: Thinking from Orleans perspective, we would need some internal tracking for both incoming & outgoing connections anyway, for when a node is evicted (declared dead) and we must forcibly terminate any connections with it.

tmds commented 5 years ago

Would reference counting be insufficient to ensure that each of those LibuvThreads are stopped when they're no longer needed either by the listener or one of the connections?

It's possible. It is probably simpler to keep them around, and then use them again when needed.

halter73 commented 5 years ago

I do want the LibuvThreads to get cleaned up before IConnectionListener.DisposeAsync/StopAsync completes. I think we can do this through a combination of reference counting the active connections and closing the listen socket.

davidfowl commented 5 years ago

I played with shutdown for a little and it seems like the simplest thing to do would be to implement graceful shutdown in the transports.

public interface IConnectionListener : IAsyncDisposable
{
    EndPoint EndPoint { get; }
    ValueTask<ConnectionContext> AcceptAsync();
    ValueTask StopAsync(CancellationToken cancellationToken = default);
}

The token is the signal for how long StopAsync waits before to tearing all of the connections and state down.

This makes it so that it's easy to do graceful shutdown without a server like kestrel in the middle but you need to call StopAsync with a token do do graceful shutdown of a specified timeout. DisposeAsync could pick a default timeout as well (based on transport specifics)

davidfowl commented 5 years ago

An alternative design is to pass a CancellationToken to AcceptAsync, this would be the signal to stop accepting new connections and graceful shutdown would be up to the caller. I think we should try to push that logic into the transports since it's so common.

tmds commented 5 years ago

I do want the LibuvThreads to get cleaned up before IConnectionListener.DisposeAsync/StopAsync completes. I think we can do this through a combination of reference counting the active connections and closing the listen socket.

Multiple listeners should be able to use the same LibuvThreads. Those LibuvThreads could get cleaned up when there are no more listeners and connections, but such a cleanup isn't functionally needed

it seems like the simplest thing to do would be to implement graceful shutdown in the transports.

The complexity is the same whether it is implemented in the transport or the server: track the connections and dispose them.

I think we should try to push that logic into the transports since it's so common.

My feeling is this is making the transport more complex. It needs to track connections per listener. If this is in the server, it can be common for different transports. At the network level there is no relation between lifetime of the accepted sockets and their listener socket, so it seems weird the ConnectionListener does that.

davidfowl commented 5 years ago

The complexity is the same whether it is implemented in the transport or the server: track the connections and dispose them

There’ll be an order of magnitude less transport implementors than consumers of the API. Ideally it should be possible to use the transport APIs directly and still implement graceful shutdown somewhat trivially. That means, gracefully waiting for connections to close (for some threshold), then aborting them if that fails. Do we want that to be in everyone’s logic instead of providing helpers that transports can use to implement it?

tmds commented 5 years ago

Do we want that to be in everyone’s logic instead of providing helpers that transports can use to implement it?

Wanting to provide this as a re-usable piece of logic doesn't mean it has to be part of the listener.

It seems better separation of responsibility. The listener stays close to a listen socket. Tracking and disposing connections is responsibility of component calling Accept. There is no functional reason the listener needs to take this responsibility.

davidfowl commented 5 years ago

It seems better separation of responsibility. The listener stays close to a listen socket. Tracking and disposing connections is responsibility of component calling Accept. There is no functional reason the listener needs to take this responsibility.

While that's completely reasonable, it makes the listener extremely hard to use standalone. In order to properly shutdown, the owner of the accept loop has to know when all of the accepted connections have completed before calling DisposeAsync on the listener. Remember our transports currently manages their own memory and it would be nice if the caller didn't have to do that to avoid ugly unexpected errors (or worse, use after free bugs!)

I'd like this to be possible:

https://github.com/aspnet/AspNetCore/blob/113f97bb2eff3ad57779c3ff893e2851202dda3c/src/Servers/Kestrel/samples/PlaintextApp/Startup.cs#L55-L90

The alternative is to expose a connection manager API that lives somewhere and needs to be used in conjunction with a listener in order to use it properly.

ReubenBond commented 5 years ago

In systems containing both clients and servers (or nodes which are both, eg Orleans silos), there's going to be some kind of connection tracking anyway.

Eg, Orleans actively closes connections to nodes which are declared dead (voted out by the cluster because they are not functioning). In that case the listener continues operating but some subset of connections are terminated. Clients and servers both do this, so both need some form of connection tracking.

The differences between incoming & outgoing connections largely disappear once the connection is established, so they should be treated similarly (if not identically.)

Generally it's better for the process to terminate abruptly, ungracefully killing open connections, rather than to hang indefinitely because some connection remains open because the developer messed up their connection tracking. This is one of the reasons that I prefer the listener does not block until connections all close.

Maybe this potential pitfall is minor in comparison to the potential simplicity gained by having additional connection tracking in the listener's contract (and presumably the connection factory's contract?)

halter73 commented 5 years ago

There should be a convenient way to close the listener in a way that tells it to stop accepting new connections and at least tries to wait for all the already-accepted connections to be closed gracefully (perhaps with some timeout). This wouldn't only mean waiting for the sockets to close, but it would also mean waiting for DisposeAsync to be called on all the accepted ConnectionContexts as a means to ensure the app is done with all the pipes and therefore also the memory pool. Otherwise, like @davidfowl mentions, we're opening app developers up to potentially very subtle use-after-free bugs.

I think it would also be nice to have a way to tell the lister to immediately abort all of its open connections/sockets. I think even this API should attempt to wait a small period of time for all the app code to react to the aborted connections and dispose all the accepted ConnectionContexts once again in order to try to avoid use-after-free issues. This API should throw if the timeout is exceeded to indicate that the memory pool isn't safe to use. Technically, if the app developer tracks all the ConnectionContexts themselves, they could individually abort each connection. So this second API is strictly necessary, but nice to have.

tmds commented 5 years ago

This wouldn't only mean waiting for the sockets to close, but it would also mean waiting for DisposeAsync to be called on all the accepted ConnectionContexts as a means to ensure the app is done with all the pipes and therefore also the memory pool.

Do you want to make the lifetime of the memorypool match the lifetime of the listener? Even without listeners you need a memorypool to make out-bound connections.

davidfowl commented 5 years ago

Do you want to make the lifetime of the memorypool match the lifetime of the listener?

Yes.

Even without listeners you need a memorypool to make out-bound connections.

Not sure what you mean here. Listeners don't have outbound connections. You might be referring to the PR where we I currently implement the IClientFactory on the SocketTransportFactory as well?

In theory, you can ref count connections to avoid disposing the memory pool too early (but then why not implement the entire connection tracking to simplify the logic).

I think we want to be in a place where this logic doesn't need to be done by consumers of the API since that makes it really hard to use.

tmds commented 5 years ago

Not sure what you mean here.

IConnectionFactory connections are not tied to a listener lifetime.

In theory, you can ref count connections to avoid disposing the memory pool too early

Another possibility is that the memory pool stays around (and maybe reduces size based on memory pressure).

davidfowl commented 5 years ago

IConnectionFactory connections are not tied to a listener lifetime.

See what I posted, the listener and IConnectionFactory are decoupled. They happened to be implemented on the same PR but it's a draft so that's completely unrelated. Pretend they are different objects, the IConnectionFactory could

Another possibility is that the memory pool stays around (and maybe reduces size based on memory pressure).

Yeah but that's complicated and we don't need to complicated things here. Reference counting would be a much easier solution to implement (track connection ConnectAsync/DisposeAsync). Though, we didn't make the IConnectionFactory implement IAsyncDisposable (maybe it should)

halter73 commented 5 years ago

After considering this more, I do not think making the listener's dispose method wait for all its accepted connections to be disposed first is really necessary to avoid use-after-free bugs. After all, whoever rents memory from the pool has to explicitly return or dereference the memory for the memory to go back to the pool. However, I do think this would help developers avoid ODEs that could arise from disposing the pool after immediately disposing the listener.

If developers see that the listener's dispose isn't completing, that should be a sign that they haven't cleaned everything up. Having the listener's dispose throw an exception after a certain timeout could make it even more clear to the developers what's going wrong, but I'm concerned that such a timeout wouldn't work well if you're trying to implement IServer.StopAsync using an IConnectionListener. The problem is the CancellationToken passed into StopAsync can stay uncanceled for arbitrarily long periods of time, and until that token is canceled we don't want to start "ungraceful" shutdown and dispose accepted connections. Having IConnectionListener.DisposeAsync() throw before the token is canceled wouldn't be good.

Ultimately, I think we might need two different methods for gracefully vs. ungracefully closing the listener. DisposeAsync() could be the ungraceful method, and we could add a new graceful method we'll call StopAcceptingNewConnections() for now. Developers who want to support graceful shutdown could call StopAcceptingNewConnections() and wait until they close all of their already-accepted connections before calling DisposeAsync().

Why do we want a StopAcceptingNewConnections() method when you can simply stop calling AcceptAsync()? The main reason for this is that we want to close the listen socket ASAP so other servers can start listening on the endpoint and the listen backlog doesn't fill up with connections the server will never accept.

Alternatively, instead of adding a StopAcceptingNewConnections() method, we might choose to add a StopAsync(CancellationToken) method like the one on IServer.

Shrinking the memory pool is an idea we've been considering for years now, but it's tough to come up with a shrinking strategy that's ideal for all use cases. At a certain point, you might as well use unpooled arrays and let the GC shrink the heap for you.

davidfowl commented 5 years ago

An alternative to 2 APIs is a cancellation token as input accept

halter73 commented 5 years ago

An alternative to 2 APIs is a cancellation token as input accept

Yeah. I saw that. I think it's OK, but I also think it's is a little weird to pass in a canceled token to AcceptAsync() as the only way to gracefully tell the listener to stop listening.

halter73 commented 5 years ago

I think it makes sense to have AcceptAsync() accept a CancellationToken anyway, but only cancel the current call to accept. I don't think it's obvious that cancelling a single call to accept tells the listener to stop accepting new connections permanently.

davidfowl commented 5 years ago

Agreed it was kinda weird when I prototyped it as well.

tmds commented 5 years ago

There is currently no API to cancel an on-going Accept: https://github.com/dotnet/corefx/issues/37118.

halter73 commented 5 years ago

There is currently no API to cancel an on-going Accept: dotnet/corefx#37118.

Yeah. I think disposing the listen socket is currently the only way to cancel an on-going accept using managed sockets. That would probably make it impossible to support a non-terminal cancellation of accepts in an IConnectionListener based on managed socket until we get a new API.

davidfowl commented 5 years ago

I updated the PR and interfaces based on the discussion here. Graceful shutdown is now a problem of the caller.

ReubenBond commented 5 years ago

I've updated to the latest PR revision & everything is looking good so far - I'm happier with the shape of things.

Why does the ConnectionListener/ConnectionFactory own the memory pool?

davidfowl commented 5 years ago

Why does the ConnectionListener/ConnectionFactory own the memory pool?

They own the pool because they own the pipe, pipes are memory managed. In libuv, we allocate a pool per libuv thread so it matters there for locality.

davidfowl commented 5 years ago

Updated the code in the original issue to reflect the latest API:

StopAsync -> UnbindAsync - Now that we have BindAsync to get the IConnectionListener, there's no point UnbindAsync ambiguous (we'll need some doc comments for this thing).
BindAsync takes a CancellationToken (it's async)

davidfowl commented 5 years ago

I implemented a barebones server with connection tracking to see what the API was like:

https://gist.github.com/davidfowl/6a4e18a870a0d418a92ebe623a9455e3

Comments welcome.

tmds commented 5 years ago

BindAsync takes a CancellationToken (it's async)

Socket.Binds are sync. So this is an async method to enable more scenarios (like MyOverlayNetworkEndPoint)?

EndPoint

Should we consider using something other than EndPoint? The EndPoint seems beneficial mostly to the Sockets implementation. An alternative could be to have a url string. Extension then doesn't rely on subclassing EndPoint.

UnbindAsync

What is the difference between UnbindAsync and Disposing the IConnectionListener?

Some comments/questions about the example:

// Wait for existing connections to end complete gracefully

This seems a bit weird since the peers don't get a notification the server is shutting down.

// Not an issue, graceful shutdown started

What is causing the OperationCanceledException on graceful shutdown?

davidfowl commented 5 years ago

Socket.Binds are sync. So this is an async method to enable more scenarios (like MyOverlayNetworkEndPoint)?

Yes it’s required in other cases when that aren’t sockets based. I’m actually looking to implement a listener that’s a basically multiplexed outbound connection. Binding in that case is asynchronously connecting to an external endpoint.

Should we consider using something other than EndPoint? The EndPoint seems beneficial mostly to the Sockets implementation. An alternative could be to have a url string. Extension then doesn't rely on subclassing EndPoint.

I really dislike the string/url binding:

it required parsing
it’s not strongly typed
it requires picking some bizarre formatting to encapsulate complex types (like a file handle)

This could be object or a marker interface. The entire point of it is that it’s extensible from the outside.

What is the difference between UnbindAsync and Disposing the IConnectionListener?

Unbind means stop listening which must then be followed up by waiting on the accept loop (as depicted in the sample). Calling DisposeAsync shuts everything down potentially not allowing things to unwind (ungraceful shutdown). In sockets, it disposed the memory pool and in Libuv it shuts the threads down.

This seems a bit weird since the peers don't get a notification the server is shutting down.

How would it if nobody told the connection it was being shutdown. That’s what the shutdown token is for.

What is causing the OperationCanceledException on graceful shutdown?

The shutdown token being passed into ReadAsync

tmds commented 5 years ago

Thank you for clarifying. The main thing I realized now is how much the example is influenced by the scoping of resources (memory pool/Libuv threads) to the IConnectionListener. That scope makes it not possible to share the libuv threads between multiple IConnectionListeners and IConnectionFactory. It also means that any valid use of Memory instances must be within that scope. So Memory can't be passed safely between between Pipes without copying (because their scope may be different). Is that correct?

davidfowl commented 5 years ago

That scope makes it not possible to share the libuv threads between multiple IConnectionListeners and IConnectionFactory.

It would require reference counting. We could look at making the factory disposable as well in order to give more flexibility to implementations.

So Memory can't be passed safely between between Pipes without copying (because their scope may be different). Is that correct?

There’s no feature to do that, with or without listeners.

tmds commented 5 years ago

There’s no feature to do that, with or without listeners.

There is a feature request: https://github.com/dotnet/corefx/issues/29022 😄

clairernovotny commented 5 years ago

Is Multicast being considered in the design for these abstractions? If not, is there a plan for improving multicast support?

davidfowl commented 5 years ago

No this is purely point to point connection oriented protocols. Nothing is being doing to address connection less scenarios as yet.

davidfowl commented 5 years ago

Keeping this open because only the server side has been implemented.

ReubenBond commented 5 years ago

How should we handle the client side implementations? SignalR & Orleans are both good candidates.

I will update Orleans to use the latest version of the server bits & make similar changes to the client bits.

A common-sounding namespace may help adoption (eg Microsoft.Extensions.Connections or System.Net)

davidfowl commented 5 years ago

How should we handle the client side implementations? SignalR & Orleans are both good candidates.

I planned to do SignalR next. You can totally do orleans with the current APIs.

I will update Orleans to use the latest version of the server bits & make similar changes to the client bits.

That would be good!

A common-sounding namespace may help adoption (eg Microsoft.Extensions.Connections or System.Net)

Agreed but that would be a mega breaking change. Not impossible but we need to discuss it as it's very late in the cycle.

mgravell commented 5 years ago

Pretty sure I've brought that up every cycle :)

But: if you need all-hands, let me know and I can play with things like SE-Redis (which has mostly client, but also a toy server).

davidfowl commented 5 years ago

The naming is a super hard break, we can’t just do the client, the server needs to be broken as well. This means we need to break IFeatureCollection as well (or just deal with that dependency?)

analogrelay commented 5 years ago

@davidfowl Is the server work for this done? Is something separate tracking the SignalR work? If so, I'd suggest closing this since there's no more actionable work for servers here.

davidfowl commented 5 years ago

@davidfowl Is the server work for this done? Is something separate tracking the SignalR work? If so, I'd suggest closing this since there's no more actionable work for servers here.

Yes the server work is done. Can we instead turn this into an EPIC issue and link to other bugs?

analogrelay commented 5 years ago

I guess? Seems overkill.

dotnet / aspnetcore