dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
9.94k stars 2.02k forks source link

What is the idiomatic way to handle external connections? #6829

Closed highlyunavailable closed 2 years ago

highlyunavailable commented 3 years ago

This is kind of a broad "design pattern" question. and appears to be somewhat related to #6819 but I've got an Orleans app that ends up acting as a websocket server and I'm wondering what the idiomatic way to handle stuff like failure/timeout of the remote endpoint is.

The idea is that I have a JSON-RPC connection over a websocket with a remote server (a game server) and the game clients connect to their own JSON-RPC websockets and do stuff - e.g. join a server and a team on that server. The server grain needs to push notifications to the server via a websocket call and wait for the server to confirm that it received the data, but it also needs to be able to cleanly handle the server not responding or being connected yet.

This feels almost like a problem for interleaved calls, and right now I've got it implemented as two methods - a start method that sends the events to an Observer in the ASP.NET controller that the server connects to, which starts a new task on the ASP.NET controller side, and then immediately returns. The second method is called from the task when it completes or times out to notify the grain of the state of the call. I've ended up making it a callback based system, and I hate it, but I have problems figuring out how to use interleaved calls for this. I can't block the server grain - it's got to be able to keep accepting players and trying to send them off to the game server, otherwise it hangs the entire callstack (which includes a grain that has to service calls extremely quickly because it's the matchmaker for a portion of an entire geographic region) while it waits.

enewnham commented 3 years ago

I would advise trying to keep a clean stack trace. MethodA -> MethodB -> MethodC -> MethodD.

To pull this off your ServerGrain will have to be re-entrant and use concurrent collections and locks. However that goes against single responsibility paradigm and nanoservices architecture that Orleans can accomplish.

Why not have one grain per client that is normal? And then one master server grain that is re-entrant?

That way the "long-running" tasks can be blocking as a normal method. And the server grain have locks and concurrent collections.

highlyunavailable commented 3 years ago

I'm pretty sure you're misunderstanding my use case here. Also locks/concurrent collections wouldn't be needed even in a server grain as even re-entrant grains are single-threaded from my understanding, but they can split execution so another method can start and modify data while another method is awaiting.

I'd love to use reentrant grains but they were significantly more difficult to debug than just using a "callback" which is why I was asking for help with best practices around this.

enewnham commented 3 years ago

Oh yeah. I agree... Hrmmmmm.

My understanding is you're trying to synchronize state between a client and a grain doing long-running tasks?

My solution has been to allow the long-running task be executed on a re-entrant grain using external tasks . I then use regular plain-old fashioned method calls to kick off and wait for the long running task to complete. This allows stack trace debugging to be significantly easier.

The client GUI should understand that it's awaiting a JSON-RPC call that's gonna take a while and not block it's UI thread.

Now for long-running/backgrounded task workflow. I would start googling around workflow paradigms and design patterns. https://github.com/danielgerlag/workflow-core or Akka.Net and Orleankka come to mind too.

My main friction was co-ordinating the GUI and backend seamlessly without a mountain of boiler-plate code... So that's my recommendation.

/// Inside my re-entrent Actor
        public async Task<int> StartTask( )
        {
            var task = await m_HttpTaskFactory.StartNew( ( ) => DoLongTask( ) );

            m_Tasks.TryAdd( taskId.GetHashCode( ), task );

            return task.GetHashCode( );
        }

        public async Task<Result> TryGetTask( int id, int timeout )
        {
            // Orleans default timeout is 30sec
            Debug.Assert( timeout < 30000 );

            if( !m_Tasks.TryRemove( id, out var task ) )
            {
                throw new ArgumentOutOfRangeException( nameof( id ), $"Task {id} does not exist" );
            }

            var cts = new CancellationTokenSource( );
            var timeoutTask = Task.Delay( timeout, cts.Token );

            try
            {
                if( await Task.WhenAny( task, timeoutTask ) == task )
                {
                    return await task;
                }
                else
                {
                    // If timed out, add back to the queue
                    m_ProxyTasks.TryAdd( id, task );

                    // null respsonse means pending
                    return null;
                }
            }
            finally
            {
                cts.Cancel( );
                timeoutTask.Dispose( );
            }
        }

/// Inside my JSON API controller code
        public async Task DoAndWait( )
        {
            var id = await actor.StartTask( elevatedAccess );
            do
            {
                var result = await actor.TryGetTask( id, 10000 );
            }
            while( result == null );
        }

I should note a lot of those above paradigms I've put into a helper utilis file

highlyunavailable commented 3 years ago

No, it's a 3 way connection.

Client -> Backend (client grain -> server finder grain -> server grain) -> External server -> reply comes back to backend -> backend does bookkeeping -> replies to client

The problem is when the external server fails (which, admittedly, is rare, but very possible). This jams up the whole works and then no client can communicate with that server or server finder because the original client call has (correctly) blocked the both of them.

Interesting that you're using a long running task for this - this is nearly exactly what I did but inside the HTTP controller that handles the server connections instead of inside the grain code. Once the task completes or fails, it uses callback methods on the server grain to notify it of the result. The end result doesn't seem much different than what you're doing except semantically, but this does help a bit and give me a few good ideas. I was trying to avoid using external tasks for things but sometimes I have to I guess.

sergeybykov commented 3 years ago

I don't quite see a way to make this call chain responsive without making server finder and server grains reentrant. If the requests they receive are always or mostly in a context of a client (without a unique client ID I assume), it should be relatively easy to keep state that server finder and server grains hold on behalf of each client (or even a client request) isolated from each other. With close to no shared state within those grains and explicit per-client data structures debugging should be easier I think than in a typical reentrant case.

Another alternative is to use one-way calls and observers between either client and server finder or server finder and server grains. That's effectively going back to callbacks. I don't like this option because of its one-wayness and higher complexity.

highlyunavailable commented 3 years ago

I have no issues with making it reentrant and I'd love to do so! Observers and callbacks is exactly how I'm doing it now, in fact. I was just trying to figure out a better way to do it. Part of the difficulty is that the server connects to the backend and then the client initiates the team find, which must be communicated to the server's "stateless frontend" connection. At the moment, I'm using client observers, and the reply from the server is what calls the callback on the backend server grain to confirm that something has happened.

I do have a unique client ID and yeah, the callback method needs pretty much the same bookkeeping as the reentrant method, so it probably wouldn't be that hard to convert. I think part of my problem was pushing data from the server grain to the server connection. I tried using streams but they block while waiting for the server to respond, so I went to observers, but then I had no way to tell if the call succeeded so I turned it into a callback.

sergeybykov commented 3 years ago

There's also a way to call client side objects reliably (to get a confirmation or an error back) instead of using observers. That's what the SMS stream provider uses.

highlyunavailable commented 3 years ago

I'll take a look. I do think I'm having a bit of an X Y problem here but you've helped out a lot. Thanks a bunch!

highlyunavailable commented 3 years ago

@sergeybykov Just to confirm, you're talking about the IGrainExtension stuff here, right? Otherwise I'm not seeing a reliable way to talk to clients, and IGrainExtension isn't exactly documented so it's not clear what I could do to bind an extension to the client. I'm also unable to see how it gets the GrainReference to the client itself to pass into stuff like the stream manager or the grain extension binder, and I'm not sure what to search for in the code since there are so many layers of indirection and interfaces.

It also appears that the ability to attach an extension isn't exposed to the world in 3.3.0 either, so that's out.

highlyunavailable commented 3 years ago

Wait a sec! I just tried something with Observers that the docs don't support and it worked. It appears that they can successfully have non-void return types and properly pass back values. Is this something that is unintended/doesn't work if you're not co-hosting the client and grains? It appears that it calls GrainMethodInvoker just fine to tell the client to do something, then handles the await and passes the data back.

I have a class like so:

    public interface ICat : IGrainObserver { Task<int> Meow(int volume); }
    public class ClientJrpc : IClientRpc, ICat {
        public async Task Pet(Guid userId)
        {
            var observer = await clusterClient.CreateObjectReference<ICat>(this);
            var grain = clusterClient.GetGrain<ICatPetter>(userId);
            await grain.OnCatAppears(userId, observer);
        }
        public Task<int> Meow(int volume)
        {
            //throw new Exception("Oh no!");
            return Task.FromResult(volume * 2);
        }
}

If I do the following in the grain:

            try
            {
                var loudMeow = await x.Meow(100);
            }
            catch (Exception e)
            {
                log.LogError(e, "From observer");
            }

it works fine and properly invokes the method/returns the expected result or exception (which logs from the context of the grain).

sergeybykov commented 3 years ago

I'm not surprised it worked. We used to enforce the rule that observer interfaces only included void methods in codegen. I believe we removed it a long time ago. CreateObjectReference<T>() still requires T to be IGrainObserver. Maybe we can even remove that requirement altogether?

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.