dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Supporting one way grain calls in Orleans #2487

Closed ashkan-saeedi-mazdeh closed 7 years ago

ashkan-saeedi-mazdeh commented 7 years ago

Related to the discussion at #2482 My original intent of working with an actor framework is to ultimately build a game engine on the server for massive realtime world and to achieve what my team and I failed to achieve in previous attempts. It is hard and can fail again (I am aware).

I think and some of my fellows think that the original problem of our previous approaches and most papers discussing it is the fact that all engines up to now want to have a set of small synchronous worlds which are connected to each other on the server and to turn the problem we want to try by making everything including main update loop an async one and yes it will need lots of care when building gameplay logic.

Now it means using Erlang which is really a good framework but a hard to use tech with obscure syntax and lack of good libraries and not that great tooling, Akka which is java and don't let me start on that and Akka.net which is not as mature as the others to begin with and the awesome forward thinking Orleans with its great community and backing of such great minds in Microsoft.

To make it possible to do such thing in Orleans we need to optimize for low latency and even higher throughput and perf. The two blocking things are

A third one comes up with one way messages which is monitoring actor health and we are back to square one partially and are essentially rebuilding Erlang style actors maybe.

To solve this I hope that I can in the time off for new year holidays I want to try to add support for


public async void Receive()
{
}

To Orleans, Depending on ppl's thoughts on serialization overhead of RPCs we can go for a global Receive() or one way RPCs.

I want to know what are the thoughts on this @sergeybykov @gabikliot @jason-bragg @ReubenBond

And also want to know if I can successfully find the time to try this out, Is there any chance that it gets merged to the Orleans framework?

There are two conflicting goals here, Orleans tries to be a fast and productive way of building actor systems and at the same time at the moment is the only .NET based actor framework which I feel I can rely on IMHO (with supporting facts of course). So the big question is, Can we try to make Orleans more powerful and suitable for more scenarios in a way that more comfortable developers can have much more control on these matters to achieve higher throughput and lower latency when the situation demands it and still the most visible and the recommended way of making software with Orleans is the default placement strategy with bidirectional RPCs?

P.S For failures, I think a stream which announces actor failures is a nice monitoring approach, Did not give that later one much thought yet. I am sure I don't want to rebuild supervision trees if possible. The intention here it is to try new wildly different better ideas right? :)

jason-bragg commented 7 years ago

@ashkan-saeedi-mazdeh, For clarification, are you and @yindongfei working on the same project, or just related initiatives?

ashkan-saeedi-mazdeh commented 7 years ago

@jason-bragg No. I even don't need the guy. They seem to be from china and @galvesribeiro showed his thread to me.

I don't think even their project is similar at all.

Currently I'm doing a game backend with Orleans. This is something that Gutemberg, me and some others talked about and initially was the reason that I got interested in Erlang years ago when was employed at www.muchdifferent.com and after that Orleans.

jason-bragg commented 7 years ago

The topic of one-way messages has come up a couple of times before, and part of the difficulty has been settling on what behavior one way messages should have.

One-way messages seem to mean different things to different people. Some expect UDP level semantics with no ordering or reliability guarantees. others expect ordered delivery, retry logic and an error if the message can't even be sent, some even expect confirmation that the messages were received.

With the above in mind, I'd ask that you take some time to clarify your requirements for one-way messaging, including ordering and reliability expectations, along with expected error cases.

On the wider topic of MMO's using Orleans, this is an interesting problem space, and one I've spent some time in the past thinking about, so I'm excited to see what you all come up with.

The below is just my opinion, take with large grain of salt :)

RPCs are well suited for discrete actions between entities. That is, {"Jim" hits "John" with axe} style interactions. Games have many of these types of interactions, but they also tend to consist of real time simulations that are poorly represented by discreet actions. or at least by course grain discrete actions. Movement is an example of this. While a player may not notice a couple hundred millisecond delay in the processing of hitting another player with an axe, they will notice such a delay in the processing of their movement.

The RPC style calls Orleans uses between grains is well suited for discrete actions, but real time simulations are probably better served by constant flows of data. The one-way messages may work for this, or possibly a custom stream provider.

Even though RPCs are well suited for discrete actions between entities, while considering distributing work for performance reasons (not for scaling) the granularity of the work being distributed should be considered. One of the reasons mmo games tend to be zone based is that this allows light weight interactions between entities most likely to interact with each other (in the same zone). This matters because much of the interactions between players (and NPCs) are so small they're not worth distributing. If hitting John with an axe is simply a case of checking his armor against Jim's attack skill and removing 2-8 points of health if it's lower, then the cost of sending the request to do that calculation to another server is more than the cost of the operation itself. While asynchronous operations are very important for scaling and multithreaded simulations, whenever possible synchronous processing is probably preferable simply for performance reasons.

galvesribeiro commented 7 years ago

@jason-bragg I was talking with @ashkan-saeedi-mazdeh yesterday about it, and to me, looks like an specialized in-memory stream provider (no persistent or external) would be better than have regular one-way (or fire-and-forget) method calls between grains. However, I wonder what would be the messages-per-second limit for that kind of streams... We did a rough estimation and an ideal number for a "WoW-clone" MMO would be 20M messages per second... Is that something you feel doable? Sergey suggested some how to batch messages but I'm not sure I understood right...

All the rest of interactions like talk to NPC, use items, make transactions, all this would be implemented as regular grain method calls.

What needs to be "streamed" are the realtime events like movement, combat, etc...

That way the world world state would be kept in memory as a whole living thing.

What do you think?

jason-bragg commented 7 years ago

I think @sergeybykov is absolutely correct about the batching. The messaging overhead of the RPC model becomes negligible if one batches multiple events into a single grain call. We do this in the streaming system already, but it's a bit clunky at the moment and would need more work.

I am unclear as to what the 20M message per second calculation comes from. Is that movement messages for all of the players in zone? Is that world state replication? Is that over a single stream or an aggregation of message counts across all player streams in a zone? In general I think this is quite possible, I just would like to understand the numbers a bit better.

For fun let's assume we're sending 20M events from a single producer over a singles stream. If the sender awaited the task return by the OnNextAsync, the max throughput would likely (minimally) be around ~1000ish events per second, if however they instead attached failure handlers to the task and moved on to calling the next OnNextAsync without awaiting it, the messages could be queued up locally and sent in batches over the previously mentioned ~1000ish grain calls per second. So 20M events per second would, under the covers, be converted into ~1000ish grain calls of 20,000 events each. It should be noted that this type of optimization assumes dense streams. If instead one was sending 20 events per second over 1M streams, this would not work as well because of the inability to batch consumption (producers could still be fine, but consumption would be an issue). It should also be noted that the per event latency would still be limited to that of the underlying RPC call.

shanegrueling commented 7 years ago

I think we should split this into two parts.

There are two main problems. How do we get enough movement messages into the grain and how do we get the state replication to the players. For the second part i would propose that the grain sends every 16 ms it's state to a stream to which every connected player listens. That can be fire and forget since the zone doesn't care about errors. It's just broadcasting it's status. So that should be doable right?

The first part is where the problem lies if i understand it correctly. The easy solution would be to make the area the grains support only so big that it handles on average a doable player base. If we have think a grain only needs to support 1.000 players it's at a tick rate of 60 tps still 60.000 events per second. Thats still quite high for what Orleans can do currently right?

The thing is i don't see how we could fix this with one way messages. The problem is not the client afais but the zone grain which needs to get all of these events. So in which way would one way messages or a stream help when the problem is the receiving end not the sending.

I think the Tickrate is still debatable. For an FPS you would need this high or higher but i think you can lower it for an MMORPG and still have good gameplay.

ashkan-saeedi-mazdeh commented 7 years ago

@jason-bragg Thanks for the reply, was up until 4AM on Thursday so was effectively half dead yesterday :)

One-way messages seem to mean different things to different people. Some expect UDP level semantics with no ordering or reliability guarantees. others expect ordered delivery, retry logic and an error if the message can't even be sent, some even expect confirmation that the messages were received.

I am fully clear in my mind and definition. The message reliability and order guarantees are an orthogonal concern which should be handled by the transport protocol (RUDP, UDP , TCP ...). Also I don't want any conformations, effectively that would mean sending an ack back with the ID of the message sent at minimum. When I say one way message I mean a message sent with at most once delivery guarantee without any responses sent back. To detect grain failures, timeout and failure monitoring should be used. Somehow like the way Erlang works with its monitors and error kernel and ...

The below is just my opinion, take with large grain of salt :)

I fully agree with you on this specially if you define an RPC as a remote method call which returns back the result of the operation. This is the reason that I think we need the one way messages I defined.

Even though RPCs are well suited for discrete actions between entities, while considering distributing work for performance reasons (not for scaling) the granularity of the work being distributed should be considered.

Again I fully agree. Basically if I want to solve the problem of a huge world with many GameObjects (a.k.a entities). The biggest issue to solve is that which entities can talk with which locally and how one can minimize over the network calls for two reasons. The hardest part of this job is the algorithm which handles the simulation. Effectively you are doing lots of small operations, A attacked B with attack C and damaged B with an amount of D. ... or A moved from point p1 to p2.

To handle that effectively the engine should be able to gather related information together locally to be able to run such operations fast and also be able to send the latest state to interested entities and also clients without hitting the network much. Scaling this is hard and that is the reason that not many done this (EVE onine is the only popular example I am aware of, partially GW2 as well).

Writing my thoughts on how to solve this without prototyping them is mostly useless but I think a property based game engine where each property of the entity is handled by a specific grain is the way to go so for each zone part, all damage properties are handled by a limited set of grains and those grains can do the job fast.

In the specific case of real-time discreet multi-agent simulations, scalability with the cost of latency is only acceptable to a certain point which makes it harder and to me the way to fix it and being able to scale is running gameplay logic in parallel and asynchronously. Until now on, all of the trials I've seen execute the logic in a zone fully single threaded so the loop updates all entities in a frame sequentially. Doing this in parallel makes it much harder to write gameplay logic (I am writing games for more than 8 years and am fully aware of this) but I think it should be possible to do a good job on this. Solving it in a game specific manner again is easy but I'm interested in building the general purpose engine technology and less in building games with it which again makes it harder :D

ashkan-saeedi-mazdeh commented 7 years ago

@galvesribeiro streams/one way grain calls ? grain calls can be more efficient in certain scenarios. Like those you are only dealing with a single subscriber.

@shanegrueling I disagree because you should not design it in the first place in a way to have hot grains like that and secondly because lowering the number of entities in a zone can only happen up to a certain point.

Zones are not fully disconnected and have to send objects to each other when they cross boundaries and have to be aware of entities in their boundaries. If you make them so small you will not only saturate bandwidth , also will have lots of additional calls due to objects moving from zone to zone and proxies and other deficiencies.

The problem as I've xperienced is exactly minimizing this and a grain's throughput is the least of my concerns

ashkan-saeedi-mazdeh commented 7 years ago

Batching can be utilized for replicating state to clients and some other places since one update per frame is enough, the property based design exactly good because all say damage properties are in the same place and the code dealing with them always have the latest values in its zone. Now a property can be a hybrid value and doesn't have to be a damage only.

shanegrueling commented 7 years ago

@ashkan-saeedi-mazdeh The property base design you talk about is it the same as component based design? Sounds to me like it and of course would something like that be the only smart decision. Wouldn't change that you still need some grain that represents the place in the world for the players which i called the zone grain. Most of the events will occur on that grain. How you call it doesn't matter. It will be a hot grain afais.

And that zones need to swap their data between each other is true too but shouldn't be a problem here. The most data will be regarding movement or do you see that differently?

What i don't see is why the zone should care if the client's get the current state. It's not like it would send it state again. ^^

ashkan-saeedi-mazdeh commented 7 years ago

@shanegrueling Well yes the component based design if you will. People call it different things. A GameObject>COmponent (entity based) like unity doesn't work in this scenario.

About my reasoning on other stuff, the zone grain only needs object position updates and 10 times per minute for that would be more than enough even for fast games, just for moving objects around.

And that zones need to swap their data between each other is true too but shouldn't be a problem here. The most data will be regarding movement or do you see that differently?

Of course it is, if you don't monitor this closely, You'll lose all benefits of distributing the server period.

What i don't see is why the zone should care if the client's get the current state. It's not like it would send it state again. ^^

Depends on what do you mean by care

Anyways the point of the issue was to get green-light from core-team to work on the messages if I get the time to do so.

Do I have the green-light @jason-bragg ? Not that I am sure I'll have the time but hopefully during charismas I should have time to play with it at least and know the codebase of Orleans.

jason-bragg commented 7 years ago

@ashkan-saeedi-mazdeh, I am of the opinion that one-way messages would be a good addition to Orleans, though I would like @sergeybykov (and if interested @gabikliot) to chime in on this, as they're more familiar with the history of the programing model and why one-way messages weren't originally included.

We don't really have a greenlighting process here, but for new features like this it's good to be clear on what is being suggested prior to spending time developing it. More than once, contributors have developed significant features, submitted a PR, and use the code review process to review specification, design, and implementation all at once. As you can imagine, this tends to be messy and takes up a lot of time.

As this thread has fragmented (and I apologize for my part in that), I'd suggest closing this thread (or leaving it open for further fragmented discussions, up to you) and creating another issue with a clear and detailed proposal of the one-way message behavior you wish to add. Much like the explanation you provided earlier, but a bit more precise and without referring to erlang, as 'make it work like this thing over hear' is not a good specification. :) For clarity of specification, I'd suggest limiting the proposal to user level behavior and capabilities, without implementation details. Some code examples of what the user level code using one-way messages might look like would also be helpful. This will help narrow the discussion to what capabilities and behaviors, at the user level, are being proposed.

jason-bragg commented 7 years ago

As for an MMO that is a radical deviation from the existing zone based patterns typically used, and, to a degree, utilizing an (virtual) actor model, this is what I was playing with in my head (and only prototyped a little). I want to be clear, I'm not suggesting the below, as it's completely unproven, just sharing some thinking I'd previously done on this subject.

For RPGs one of the largest real time costs is physics and movement. The server collision model can be much more coarse grain than the client and needs deal only with validating movement (prevent speed cheating and walking through walls) so the cost is less than the client, but it's still a high cost due to the number of players. To address this I'd envisioned a set of physics servers that can scale and each receive the movement messages from a subset of the player base. These movement messages are validated, and only corrections (uncommon, unless mass network issues or cheating) are sent back to clients. These services would also keep the player location in a spatial DB (quad tree, oct-tree, ??) and periodically (1/sec?) update server side player proximity, so combat services and other game features can know what other entities (players, npcs ) are relevant to the player. When entities become relevant to a player the player can now interact with them (make grain calls?) and subscribe to their state updates. State changes of the player and relevant entities can be replicated to the client in a continuous flow of data.

So a game client would send a continuous flow of movement messages to the server, and receive a continuous flow of world state changes or position corrections (from the physics services), and would trigger discrete actions (eventually grain calls) against entities in it's awareness.

It should be noted that while the physics services will update the players list of relevant entities, this does not mean that it is the only system that does so. For instance, if a player formed some sort of party or group with other players, those players may be added to a players set of relevant entities even though they are not near each other.

It should also be noted that two players near each other in the game world may be sending position information to two different physics services, so there will need to be some form of course grain aggregation of the physics services spatial databases to build the complete list of relevant entities for a player. I think this can be accomplished with 'zone' grains that maintain a coarse grain spatial database for a defined section of the world (with a small overlap with other such grains for entities on the boundaries).

I was prototyping the physics service when I started at 343 4 years ago in c++, but the learning curve when I joined 343 was sufficient that I put this on hold and never got back to it.

My hope was that this architecture would provide a very scalable MMO framework that is very responsive to the real time aspects of the game. It also allows for AI (pathing, spawning, ..) to only effect the players experience to the degree that a player is interacting with those systems. This allows for much less resource constrained AI, including world level data driven shifts in npc behaviors. This loose coupling of AI also allows for the framework to suit a wide range of game types (Eve to PlanetSide), as the gameplay logic is separated from the real time physics simulations.

ashkan-saeedi-mazdeh commented 7 years ago

@jason-bragg

We don't really have a greenlighting process here, but for new features like this it's good to be clear on what is being suggested prior to spending time developing it. More than once, contributors have developed significant features, submitted a PR, and use the code review process to review specification, design, and implementation all at once. As you can imagine, this tends to be messy and takes up a lot of time.

I'll close this issue (Feel free to open if you have any thoughts to put here) and I will open another one with clear specifications as I am not a fan of taking more of your time than I have to. Will do it with clear user code examples and semantics of the messages and how I think it relates to Orleans programming model. I consider this a feature for the specific cases only anyways.

As for an MMO that is a radical deviation from the existing zone based patterns typically used, and, to a degree, utilizing an (virtual) actor model, this is what I was playing with in my head (and only prototyped a little). I want to be clear, I'm not suggesting the below, as it's completely unproven, just sharing some thinking I'd previously done on this subject.

:) Maybe I should join 343 industries!

What you are describing is very similar to what I thought. I'm not sure if the physics server should be in Orleans at all and I guess maybe a C++ based fast one is the best option with a fully deterministic integer based calculation engine.

The thing that big world engine does and we did in PikkoServer with headless unity instances was synchronous zone servers written in normal game engines which didn't work well at all.I linked to Pikko from #2482 http://developer.muchdifferent.com/unitypark/PikkoServer

I guess limits can be pushed and we can do better. EVE online uses stackless python for zone logic but I am not sure if they run game logic in parallel or not, I guess they might partially do. I've worked a bit with one of their former engineers of CCP which thinks async zone servers with a good management of AOI and async physics should be the only way to do it. Let's see how it will go forward. I highly highly enjoyed talking about it to someone who understands it very well. Thank you for the discussion.