Use Orleans as a MMORPG server

yindongfei commented 7 years ago

Hi,

I want to use orleans as a mmorpg server backend and I have read the documents and tutorials on the web site, two things I need help are:

performance, some one said that 10K call per second, I have tested that on my computer, it looks like some times more than 10K, it's great. but the cpu usage it very high, nearly 40%~50%, as #2374 said, I have the same problem(i7 6700K 4GHz).
I want to know does orleans silo can load dlls during running or can update dll during running, what I want is to do some bugfix without turn the silo off.

thanks.

xiaolige945 commented 7 years ago

完全可以用来做mmorpg，性能问题可以通过线性扩展vm解决。可以用Orleans实现BigWorld一整套mmorpg思路。

xiaolige945 commented 7 years ago

http://bigworldtech.com/en/technology/bigworld-server/

yindongfei commented 7 years ago

@xiaolige945 of cause we can, but it is too time consuming to write a big world server again use orleans. we have our own mmorpg server and I just want to refactor it using orleans to more scalability.

xiaolige945 commented 7 years ago

明白，用orleans重构现有服务端引擎是一项比较大的工程，更不用说重写bigworld了^^。主要是AoI这块需要注意下，Grain有一定性能限制。

ashkan-saeedi-mazdeh commented 7 years ago

@yindongfei
1- Orleans CPU usage is ok and there are nightly tests for it, what is the issue in your specific case , I can not comment on without knowing much more details. If you look at the referenced issue, you see that Orleans is optimized for throughput and will work perfectly with more traffic. this is what you saw when you had multiple grains. Generally you should avoid grains which are hot enough to get 20k requests per second. 2- No you can not hot load code without using all instances of the classes in memory in .NET in general. If you want to only reload assemblies but are ok with grains failing, this is not supported at the moment yet as well I guess but not fully sure. You however can achieve no downtime with rolling upgrades.

jason-bragg commented 7 years ago

@yindongfei, Performance - Orleans networking, while optimized, is fundamentally an RPC system, and exhibits all of the performance issues general RPC systems tend to have. The convenience comes at a cost. The overhead may be further reduced by custom serialization logic and tinkering with the network layer tuning knobs, but what you are seeing is not unexpected. As @gabikliot suggested in #2374 you should not trust load test results against a single machine. Orleans is optimized for cluster performance.

Dynamic loading - To my knowledge, Orleans has no support for this. This would have to be addressed at the application layer.

I suspect Orleans would be a good fit for quite a few MMORPGs, but for large 3d, zone based, performance critical MMORPGs like Worlds of Warcraft, Eve Online, Final Fantasy, ..., Orleans may not suffice. Depending on requirements and the game engine you are using, you may want to consider separating the game simulation (client and server) from supporting services (lobby, store, persistence, customer support, ...), and use Orleans in the supporting services, but not in the performance critical data pipe between game client and game server.

yindongfei commented 7 years ago

@ashkan-saeedi-mazdeh 20K request is quite enough for our usage. My problem is during developing, almost all developers and designers will start servers on their own machine for quick test. High CPU usage will slow down their machine for other work like excel, visual studio etc. maybe being able to configure the spin-lock behavior is an option in this case.

@jason-bragg we can do dynamic loading in application layer as we are doing now.

Scene management is the most performance critical part in our mmo game server(3d, zone base, not as large as WOW but several small scenes), move, damage, skill is the top three packages, about 10 request per second per user(after heavily optimized). so for hundreds players in a scene is fine for a grain as the scene (calculated by 10K request per second ). when a message like chat, move etc send to our front end, the front end server is responsible for broadcast messages to target clients. so the data pipe is not a hot spot in our case.

We may start doing the refactor work next month, thanks for the great Orleans and Developers.

yindongfei commented 7 years ago

@xiaolige945 Aol is now processing inside a scene server, and i want to make a scene as a grain. so all the Aol logic is processed inside a grain, ai and path finding will processed in other grains locally. load of scene grain may not so heavy, and when load is heavy, the scene will be split.

thanks again for your advises.

ashkan-saeedi-mazdeh commented 7 years ago

@yindongfei no problem.

@jason-bragg Hopefully in the time frame starting about 4-5 months from now, we would like to tackle the challenge of making an MMO backend using Orleans, I've been on a team which used Erlang before on a similar project and also had a similar C# based system as well. How do you think we can have a talk to see if Orleans would fit the task and how much we can change it to make it more suitable for the task. Systems like akka.net potentially with their one way messages without RPCs can be made more performant but first of all akka.net is not as mature and good as akka and using Erlang is a hard option due to high learning curve for the whole team and I prefer to go the Orleans root if possible.

For latency, I thought of opening up the placement strategy director and writing zone aware strategies and some logic to move grains between zones based on an octree of grain (object) positions. but for networking and message performance, we should see how much of an issue that is and how we can fix it or even do we want to do so or not. I'm not sure how much of the bottleneck would be the deserialization of the message and execution of the generated code and how much is the two way message system. I imagine 2 way message system and its bandwidth implications are the bigger issue based on my previous experience. What do you think? Can we ever have a skype call or a chat in gitter on this when you get the time?

@sergeybykov briefly mentioned to me once that he thinks promises are a much better approach in an actor framework (maybe Sergey only meant from productivity point of view), I would like to know thoughts of your guys on this. My plan was to try fire and forget streams for messages with high amount of traffic and see how it will work out but as a part of Microsoft Game Studios, probably you guys talked this with your internal teams and have a good idea about characteristics of a massive realtime world with Orleans (say GW2 or EVE) as you said.

However what I have in mind is a bit different and doesn't lend itself well into distribution like the way EVE's galaxies do. I did some research on this and talked to industry veterans and distributed system experts on this but writing them all down here doesn't make sense probably :)

sergeybykov commented 7 years ago

@sergeybykov briefly mentioned to me once that he thinks promises are a much better approach in an actor framework (maybe Sergey only meant from productivity point of view), I would like to know thoughts of your guys on this. My plan was to try fire and forget streams for messages with high amount of traffic and see how it will work out but as a part of Microsoft Game Studios, probably you guys talked this with your internal teams and have a good idea about characteristics of a massive realtime world with Orleans (say GW2 or EVE) as you said.

IMO promises are much better than using correlation IDs to correlate responses with requests, especially in cases when some responses never arrive. This helps reasoning and writing simpler, more correct code.

In the streaming APIs we used Tasks for non-fire-and-forget methods to help surface failures and to give a simple back pressure mechanism.

ashkan-saeedi-mazdeh commented 7 years ago

@sergeybykov In the Unity client SDK of my backend I used IDs because unity is .NET 3.5ish old mono and there is no tasks but on that front I 100% agree with you.

I always used a class partially like this


class Request<T>
{
public T Value{get;set;}
public State CompletionState;

//This is a Unity coroutine kinda like await
IEnumerator WaitUntilDone()
{
while(CompletionStatus == Status.Working)
    yield return null;
}

public void Faild(){}
public void Complete(T result){}

}

And a RequestManager and IDs to create promises. Anyone who thinks otherwise either should be in a code part which allocating these promises is expensive for (which I think with pools can be fully removed) or be a mad alone guy in his lab not dealing with real world code or at the end has a mind which is used to the hard way enough that thinks everyone will be ok with it.

My question is, Even perf wise , is there any advantages if one uses correlation IDs?

sergeybykov commented 7 years ago

Another design pattern to consider is using Orleans as a 'control plane' for such a scenario with each session grain starting and controlling an external process. The session process is free to perform whatever 'data plane' communication it needs to do, e.g. high frequency UDP messaging. The grain can be used for sending less frequent commands to the session, and can communicate with the session process over an IPC mechanism.

Another potential benefit of this approach is that when the session completes, its process gets terminated and for sure releases all system resources it might have allocated. This ensure there is no leakage of resources over time even if a bunch of libraries of varying quality are used by the process. We've seen this pattern used in one project for loading a 25MB blob of various native code libraries that no one could guarantee would behave correctly inside a long running managed process of silo.

ashkan-saeedi-mazdeh commented 7 years ago

@sergeybykov It is possible and our company back then did this with Erlang before and even broke a world record, but it has its own drawbacks too. Specially if that process is single threaded, then you are potentially more limited than the case which everything is implemented asynchronously in Orleans. But effectiveness of the approach is highly dependent of the use case of course.

What we did was this

the world was divided into zones which each of them was a Unity or Unreal Engine process running single threaded game logic. 12 times per minute all object positions were synced to Erlang based server and the server could decide to move an object from one game logic server to another and instruct the unity/UE processes to serialize and send the object to the other one as a message and transfer ownership of it too. Main issues were CPU and network concumption of game server processes and bandwidth saturation between them when you got many fast moving objects.

The problem gets more complex if you consider the fact that each game server should know about the objects which live near its borders in other instances as well and have their proxies. I think with async distributed game logic and minimizing global messages and reliable messages at least from clients the limits can be pushed. Square Enix tried an approach similar to what you suggest and project got shutdown for either technical or business reasons. it was called shinra. I guess big world technologies tried something similar too.

We tried with PikkoServer http://developer.muchdifferent.com/unitypark/PikkoServer

https://www.google.com/search?site=&source=hp&q=pikkoserver+&oq=pikkoserver+&gs_l=hp.3..0i30k1l4j0i5i30k1.1581.3867.0.4468.13.11.0.0.0.0.348.1742.2-6j1.7.0....0...1c.1.64.hp..6.6.1533.0..0j0i131k1.Sk3B2h4FMIE

Potentially the limits can be pushed with game logic written in an async distributed manner in Erlang/Orleans in expense of the productivity of the average game play programmer.

yindongfei commented 7 years ago

@ashkan-saeedi-mazdeh we use

IEnumerator WaitUntilDone()

in our untiy client exactly same with you :)

@sergeybykov another game like clash royale we built use same design pattern as you said before, we start a process for a single battle, process is closed when battle ended. so crash, memory leak are not serious problem as usual. but the 'control panel' is not implement by orleans.

@ashkan-saeedi-mazdeh

1- Orleans CPU usage is ok and there are nightly tests for it, what is the issue in your specific case , I can not comment on without knowing much more details. If you look at the referenced issue, you see that Orleans is optimized for throughput and will work perfectly with more traffic. this is what you saw when you had multiple grains. Generally you should avoid grains which are hot enough to get 20k requests per second.

What kind of machine can achieve 20k request per second for a grain with reentrant or several same grains? Or how many machines in a cluster can achieve 20k request per second?

ashkan-saeedi-mazdeh commented 7 years ago

@yindongfei Well for grain request counts per second, @sergeybykov will be the best source of truth.

In general you should try to avoid having hot grains as much as possible. If your grains are stateless, you can autoscale by using StatelessWorker grains and other than that reentrant grains have a higher perf than normal ones but be careful to don't shoot yourself in the foot with them. In specific scenarios sending messages to grain from fire and forget SMS streams might help as well but you should do this only if request and response processing is a considerable overhead of what you are doing. Batch requests together when you can as well. Other than these, just try to partition your problem and grains so each grain handles less messages and you can scale better.

These are just my 2 cents but I've only used Orleans for a few months and have some prev experience with Erlang so might not be the best one to answer.

yindongfei commented 7 years ago

@ashkan-saeedi-mazdeh thank you for your reply. batch requests make scenes to me and I can try to test it.

@sergeybykov could you please tell me how to get grain request counts per second correctly, in our scenarios, a scene is a grain contains about 100 players, all the moving, skill will processed in this grain. so a scene grain will received over 1K request per second, AI, path finding will in another StatelessWorker grains.

sergeybykov commented 7 years ago

@sergeybykov could you please tell me how to get grain request counts per second correctly

Sorry, I'm not sure I understand the question. Are you asking how to measure it correctly or how to design your app structure to get reasonable numbers?

dotnet / orleans

Use Orleans as a MMORPG server #2482