Azure / azure-relay-dotnet

☁️ .NET Standard client library for Azure Relay Hybrid Connections
https://docs.microsoft.com/en-us/azure/service-bus-relay/relay-what-is-it
MIT License
36 stars 33 forks source link

Missing / Not Acknowledged response from HybridConnectionListener #115

Open deadwards90 opened 5 years ago

deadwards90 commented 5 years ago

I have a .NET Core Windows Service that connects to two Hybrid Connections with code similar to the following:

Azure Relay Listener - This does the actual handling of the request. https://gist.github.com/deadwards90/5e5b9972ac87af1b48e9c9873e6d50e7

Worker - Implementation of the HostedService, starts a collection of AzureRelayListeners (in my case two listeners pointing at different connections) https://gist.github.com/deadwards90/cdcaffcbeada6045710e9cf4d3674c42

DI - Adds mulitple AzureRelayListeners, picked up by the Worker. The RequestHandler is a Singleton but I'm not sure that would make a lot of difference. https://gist.github.com/deadwards90/7ff54966fe24c98973ce9fc91e5f71e8

On occasion, sometimes a connection appears to get into a strange state where it will receive messages and respond in the same way it usually does but the requesting application never receives the response.

So what I can see is a log saying that the operation has been run on the box, no errors or other information like a normal request but then the request application just times out waiting for the response.

This seems to have been exacerbated (or it's a coincidence) by having the application connect to two Hybrid Connections.

Expected Behavior

All responses are received by the calling application when the Stream is disposed.

Actual Behavior

After some time, responses on a connection fail to return to the calling application after the Stream has been disposed of

Versions

dlstucki commented 5 years ago

Would you be able to share details from the error messages you receive? Even something like this with the TrackingId and Timestamp would be awesome:

The request was routed to the listener but the listener did not respond in the required time. TrackingId:abf5350a-c43b-43db-b12f-8c905b826c5e_G1, SystemTracker:{REMOVE_THIS_IF_YOU_WANT}, Timestamp:2019-09-23T17:15:39

Or you can email me directly: dstucki (at) microsoft.com

dlstucki commented 5 years ago

There are some internal issues that are currently being ironed out which unfortunately lead to errors like "The request was routed to the listener but the listener did not respond" even though the listener did indeed respond. We're actively working on getting those fixed. If the HTTP request or response is > 64KB then this issue would not be seen.

deadwards90 commented 5 years ago

Thanks for getting back to me.

Where can I find the Tracking Id? I'm using App Insights and an Azure function to do the communication with the listeners.

dlstucki commented 5 years ago

Well, there have been a few "known" issues happening in the last day or so. That was going to be one way to check if you were in one of the locations where those had been occurring without asking you for your X.servicebus.windows.net namespace. I could look it up given a TrackingId like that, given your namespace, or you could run nslookup X.servicebus.windows.net (replacing "X" with your namespace) and tell me the *.cloudapp.net dns name.

e.g. nslookup dstucki-relay-usw.servicebus.windows.net gives ns-sb2-prod-by3-003.cloudapp.net.

deadwards90 commented 5 years ago

I'm getting ns-sb2-prod-ln1-001.cloudapp.net

wmmihaa commented 3 years ago

Was this ever resolved?

deadwards90 commented 3 years ago

@wmmihaa Not to my knowledge, but I have since left the place I was working so I don't know if they ever found a workaround.

TorstenL commented 3 years ago

Hey Guys,

it looks like we observe the same/ similar issue here. Message source is a Azure EventGrid however.

Even if we strip down the handler to the very basic:

private void RequestHandler(RelayedHttpListenerContext context) { try { var content = new StreamReader(context.Request.InputStream).ReadToEnd(); _logger.LogInformation(content); context.Response.StatusCode = HttpStatusCode.OK; context.Response.Close(); } catch (Exception e) { _logger.LogError(e, "Error on EventGridHandler RequestHandler"); } }

we seeing successful delivery attempts for around 1-2 minutes. Thereafter we observing a raising handle and thread count (at process level) followed by event TaskCanceledExceptions.

System.Threading.Tasks.TaskCanceledException: A task was canceled. at Microsoft.Azure.Relay.HybridConnectionListener.ControlConnection.SendCommandAndStreamAsync(ListenerCommand command, Stream stream, CancellationToken cancelToken) at Microsoft.Azure.Relay.HybridHttpConnection.SendResponseAsync(ResponseCommand responseCommand, Stream responseBodyStream, CancellationToken cancelToken) at Microsoft.Azure.Relay.HybridHttpConnection.ResponseStream.CloseAsync() at Microsoft.Azure.Relay.HybridHttpConnection.ResponseStream.CloseAsync() at Microsoft.Azure.Relay.RelayedHttpListenerResponse.CloseAsync() at Microsoft.Azure.Relay.RelayedHttpListenerResponse.Close() at DisturbanceAnnotation.EventGrid.EventGridHandler.RequestHandler(RelayedHttpListenerContext context)

from time to time the "Connecting" Handler is called in this moment the "LastedError" Property contains:

Microsoft.Azure.Relay.RelayException: The WebSocket is in an invalid state ('Aborted') for this operation. Valid states are: 'Open, CloseSent'. TrackingId:7786bf4f-800d-4f6f-afb9-01b7710f2767, Address:sb://"not for public :) ", Timestamp:9/3/2021 8:52:21 PM ---> System.Net.WebSockets.WebSocketException (0x80004005): The WebSocket is in an invalid state ('Aborted') for this operation. Valid states are: 'Open, CloseSent' at System.Net.WebSockets.WebSocketValidate.ThrowIfInvalidState(WebSocketState currentState, Boolean isDisposed, WebSocketState[] validStates) at System.Net.WebSockets.ManagedWebSocket.ReceiveAsync(ArraySegment`1 buffer, CancellationToken cancellationToken) --- End of stack trace from previous location --- at Microsoft.Azure.Relay.HybridConnectionListener.ControlConnection.ReceivePumpCoreAsync() --- End of inner exception stack trace ---

In AzureDiagnostics following exceptions can be found

ResourceProvider: MICROSOFT.RELAY OperationName: Microsoft.Relay/HybridConnections/ListenerReturningHttpResponseFailed Message:_ Returning HTTP response with failure response code: NotFound due to: EndpointNotFoundException: There are no listeners connected for the endpoint. TrackingId:9989a2b8-47f1-4871-9b56-5c006a6f4ef1_G5, SystemTracker:"not for public :) ", Timestamp:2021-09-04T08:49:05

However we still get new "RequestHandler" callback invocations. I think the problem might be related to the non blocking Task.Run in "Microsoft.Azure.Relay.HybridHttpConnection" -> CreateAsync() that might lead to a thread pool exhaustion if the "CloseAsync()" calls takes longer than expected

public static async Task CreateAsync(HybridConnectionListener listener, ListenerCommand.RequestCommand requestCommand, WebSocket controlWebSocket) { var hybridHttpConnection = new HybridHttpConnection(listener, controlWebSocket, requestCommand.Address);

        // In this method we're holding up the listener's control connection.
        // Do only what we need to do (receive any request body from control channel) and then let this Task complete.
        bool requestOverControlConnection = requestCommand.Body.HasValue;
        var requestAndStream = new RequestCommandAndStream(requestCommand, null);
        if (requestOverControlConnection)
        {
            requestAndStream = await hybridHttpConnection.ReceiveRequestBodyOverControlAsync(requestCommand).ConfigureAwait(false);
        }

        // ProcessFirstRequestAsync runs without blocking the listener control connection:
        Task.Run(() => hybridHttpConnection.ProcessFirstRequestAsync(requestAndStream)).Fork(hybridHttpConnection);
    }
dlstucki commented 3 years ago

Hi @TorstenL, sorry it's taken so long to respond. Can you update the RequestHandler to do the following?

  1. Call response.CloseAsync
  2. Dispose the StreamReader instance
  3. Read the stream contents asynchronously
  4. Ensure that logger.LogInformation(content) isn't doing blocking work as this handler will be invoked on a threadpool thread. If it is doing work that takes too much time then it may need to be updated to do its work asynchronously as well.
  5. This probably isn't important but adding the .ConfigureAwait(false) calls in conjunction with all the await keywords can help rule out SynchronizationContext issues (like ASP.NET, WCF, WPF apps might encounter).
// Note: When using 'async void' on a method it's crucial to not allow unhandled exceptions 
// (addressed by the catch block at the end)
async void RequestHandler(RelayedHttpListenerContext context)
{
    try {
        using (var streamReader = new StreamReader(context.Request.InputStream)) { // Item 2
            string content = await streamReader.ReadToEndAsync().ConfigureAwait(false); // Item 3, Item 5
            _logger.LogInformation(content); // Item 4 (review)
        }

        context.Response.StatusCode = HttpStatusCode.OK;
        await context.Response.CloseAsync().ConfigureAwait(false); // Item 1, Item 5
    }
    catch (Exception e) {
        _logger.LogError(e, "Error on EventGridHandler RequestHandler");
    }
}

Try with these modifications and let me know if anything changes or the problem persists.

TorstenL commented 3 years ago

Hi @dlstucki, thanks for your answer. Nice to see that this library hasn't be abandoned. :-)

Well i would love to use async processing in the requesthandler.

But as the HybridConnectionListeners RequestHandler callback handler signature is Action<RelayedHttpListenerContext> RequestHandler - i would not expect that i have to use a new not awaited task. :-)

Please let me quickly sketch out our use case. We’re using the HybridConnectionListener as “Proxy” between our webapp and a azure eventgrid. Naturally the EventGrid subscription stays active the whole time as it statically configured in the azure portal. This might however result in a situation where thousands of small undelivered messages are piled up in the subscription queue if the webapp is not available for some time. Which was the case as we observed this problem.

Please correct me if i’m wrong, but if I look in the HybridConnectionListener implementation it looks to me, that for every single message a new not awaited task is created.

Wouldn't this lead to either timeouts if it takes too long to process a message as of a tremendous amount of parallel running tasks or thread pool starvation if synchronous calls are used ?

dlstucki commented 3 years ago

@TorstenL, The thought behind HybridConnectionListener.RequestHandler being merely an Action instead of a Func<RelayedHttpListenerContext, Task> was to avoid allocating a Task instance if that wasn't needed. User's don't have to implement a "not awaited task". They don't even need to use Tasks, just need to call context.Response.Close[Async]() whenever and wherever they're finished processing the request/context. It's perfectly fine for user code to call context.Response.CloseAsync on another thread sometime after returning from the RequestHandler Action itself.

The HybridConnectionListener SDK also uses the .NET ThreadPool worker threads for it's various internal async operations. If anyone is taking too long to process items or making synchronous calls which take time on the .NET ThreadPool worker threads then Relay/HybridConnectionListener will also be affected by threadpool starvation. Also worth noting, this library had supported UWP which didn't even allow use of the System.Threading.ThreadPool class (See ActionItem.cs).

Please correct me if i’m wrong, but if I look in the HybridConnectionListener implementation it looks to me, that for every single message a new not awaited task is created.

What you describe sounds accurate. We use Task.Run to queue a workitem to the .NET ThreadPool for invoking the user's RequestHandler so that the HybridConnectionListener's "control connection" can start looking for the next message from the cloud service in parallel. Is your concern that there's no way for a listener implementation to introduce back-pressure leading to the listener processing code slowing down? I suppose throttling like that might be a missing feature. Something like "MaxPendingRequests"?

TorstenL commented 3 years ago

@dlstucki Yes that makes sense. And yes, something like "MaxPendingRequests" would be exactly the missing feature which would be very helpful in our (HybridConnection<->EventGrid) usecase.