Zero Allocation Networking

martindevans commented 5 years ago

I'm currently working on adding a Mirror+Ignorance network backend to Dissonance Voice Chat. Dissonance itself has been written to write into preallocated and recycled byte[] buffers to ensure that it doesn't allocate anything per packet (voice packets happen at a fairly high rate, so this is important).

However I'm having some difficulty actually sending these byte buffers through Mirror without allocating something somewhere!

NetworkClient:Send(short msgType, MessageBase msg) requires me to allocate a MessageBase object.
Using NetworkConnection:TransportSend(int channelId, byte[] bytes, out byte error) is also no good, I can't specify the length without copying my packet into another array of the exact size.

I also tried to pack my data into a byte[] array formatted correctly for TransportSend and this requires more allocations:

byte[] PackMessage(ushort msgType, byte[] content) requires me to pass in a byte array of precisely the right size (one allocation+copy) and then it internally allocates a NetworkWriter containing a BinaryWriter around a MemoryStream!

would you be interested in a PR which added the following:

`NetworkClient:Send(short msgType, ArraySegment<byte> data)` & `NetworkConnection:Send(int channelId, ArraySegment<byte> data, out byte error)`

This will insert the protocol bytes into the segment if there is space available. May involve a copy to shift the offset a sufficient amount, or an allocation if there is insufficient space.

`NetworkConnection:TransportSend(int channelId, ArraySegment<byte> data, out byte error)`

This will call the new ServerSend or ClientSend with these arguments. Exactly the same logic as the current TransportSend, just with new types.

`Transport:ServerSend(int connectionId, int channelId, ArraySegment<byte>)` & `Transport:ClientSend(int connectionId, int channelId, ArraySegment<byte>)`

These would be implemented as virtual methods which call the normal ServerSend/ClientSend, this involves a single allocation and a copy. Implementing them as virtual ensures this does not break everything which extends Transport. Network transports can override these new send methods to do the sending with zero copies if they support it - I've already submitted a PR to ignorance which shows what this would look like: https://github.com/SoftwareGuy/Ignorance/pull/16/files

miwarnec commented 5 years ago

are you running into serious bottlenecks because of those allocations?

original UNET tried very hard to be allocation free, which was one of their internal requirements when writing code. original UNET was able to handle around 20-50 CCU on a good day.

with Mirror we focused on simple code (we have half the LOC than UNET had), and high level / big O optimizations instead. Mirror isn't allocation free. Mirror can handle 500+ CCU.

martindevans commented 5 years ago

I must admit haven't profiled it as causing problems - I'm writing this for release to all Dissonance customers, so I can't profile it on all possible devices and I'd like to make sure I don't cause any problems in the first place! The whole of Dissonance is written with zero allocations as a main design goal so it's just kind of my default way of doing things.

If you'd rather keep Mirror simple that's fair enough, that's why I asked ahead instead of submitting a PR out of the blue. If you don't want me to make the PR for these changes go ahead and close this issue :)

GabeBigBoxVR commented 5 years ago

Hey @vis2k, we're making a product using Mirror for Android and we'd like to avoid allocations where possible. If we can overload the API to avoid allocations and keep simplicity, that would make this far more useful on Android. Right now it allocates too much for our particular app since we're very very frame rate sensitive in VR. We need to hit close to 72 fps and these allocations are too much for our project. :( We'd love to make these changes, keep the simplicity and hit our framerate goals if possible. The added benefit is that all other mobile projects will get a nice speed boost without exposing complexity.

paulpach commented 5 years ago

We don't aim to be allocation free. We aim for simplicity and performance.

When I say performance, I mean profile it and show a problem. If something does not show in the profiler it is not worth optimizing.

The way to go about this is to open a pull request, and show the profiler before and after for a realistic scenario.

I also suggest you take a look at the 2018 branch. I have already profiled and fixed a lot of allocations that were slowing down my mobile game. See #299 for some benchmarks towards the end.

c6burns commented 5 years ago

It should be fairly trivial to show what a GC spike looks like when these allocations add up. Someone invested in this with a real implementation should do so, as opposed to a more contrived example I would create. I know what GC does at a low level (your entire app will be basically paused during collection) but I am interested to see the effects in real world apps.

Buuuut ... I'm not a big fan of slapping an ArraySegment over this as proposed. It's been discussed before, and it's not the worst idea, but in the end it's just trading one concrete implementation for another. And I'm not just dreaming of abstract perfection, I mean that neither Array nor ArraySegment will work with my transport without forcing copies (and thus allocations) between the native and managed heap.

IMHO the right long term move is an interface, so we are no longer limited to a specific implementation and also do not break the API down the road with implementation switches. A default implementation could be provided using whatever is felt to be the general case, and could support Array and ArraySegment.

Also, I know this is asking a lot from a potential PR, but it would be very helpful to have tests in place before switching something as core as this. There is a lot of code pinned on these read/write ops and it would be nice, especially for long term maintainers, to have some level of certainly it's not getting blown up. There are already some tests, and more wouldn't hurt. Especially nice would be a test that network message binary compat hasn't been broken with older Mirror versions (unsure if that exists yet)

Also also ... I think the referenced PR is glossing over a huge part of the problem, which doesn't exist at all at the transport layer. For example when reading messages (unless I'm mistaken) the NetworkReader gets the incoming messages as byte[] and constructs a new BinaryReader and MemoryStream. In fact, entire NetworkReader/Writers are allocated and left to GC. Serious love and collaboration with existing maintainers will have to go into any potential solution which aims for zero allocation.

Just my 2 cents

miwarnec commented 5 years ago

@GabeBigBoxVR did you try unity's new incremental garbage collector yet? that might solve your problem.

martindevans commented 5 years ago

I'm not a big fan of slapping an ArraySegment over this as proposed. It's been discussed before

Interesting, could you link to any previous discussion on it? I couldn't find any with a quick search in this repo.

neither Array nor ArraySegment will work with my transport without forcing copies (and thus allocations) between the native and managed heap.

I don't agree. If a transport wants to implement sending by directly copying the contents of an ArraySegment into a socket without any allocations that's perfectly possible. Even if the transport does copy onto the native heap I don't think it's much of a problem - a single native heap allocation is often slower than a managed allocation, but you never incur the huge GC collection delay down the line. That's all besides the point though - the point of this PR is reducing allocations before it gets into the transport layer.

For a concrete example of this I already have this very simple PR open with Ignorance which directly used the data inside the array segment and passes a pointer to it down into ENet - I tracked the execution of that all the way down to the ENet FFI call and as far as I can see there isn't a single GC allocation.

I'm not just dreaming of abstract perfection

Span<byte> 🤤

IMHO the right long term move is an interface ... switching something as core as this

That's possible I guess but it vastly expands the scope of the changes. My proposal here is to add 3 new methods which would be very simple to implement (basically s/byte[].Length/ArraySegment.Count), it wouldn't involve a breaking change the public API for users of Mirror or implementors of transport layers.

glossing over a huge part of the problem

I can only fix one thing at a time in a single PR ;)

If this gets merged in and I still have time to work on the problem I will probably move on to trying to clean up receiving allocations where possible.

martindevans commented 5 years ago

I've submitted #404 to clarify exactly what API changes I'm proposing here :)

c6burns commented 5 years ago

Span<byte> 🤤

Not to get too sidetracked, but I only recently started with C#. To be clear you are referring to the ref struct implementation that only exists in .NET Framework as a port from .Net Core 2.1 where it was introduced. The portable version is not the same as the one in System.Private.CoreLib, and if you look into it you will find its performance characteristics considerably poorer. I find it about 4-5x slower than an array once you need to index into it. Even if in some miracle world Unity was using .NET Core it still doesn't get me anywhere to use Span<>. There's an assumption here somewhere about where and why my memcpys are being forced, and that all network messages will always arrive and be sent through the transport in contiguous memory. But these are not everyone's or even most people's problems, I suppose. I see them only as hurdles to more tightly integrating with lower level networking libraries.

That's possible I guess but it vastly expands the scope of the changes

Totally, I get it. Hard work is hard and there are more exciting things to do. So say we all. I don't wanna make an interface in there either, man. It's way easier to just expose more API surface to handle more apt concrete implementations. And it's better than nothing.

I'm actually not opposed to your changes, though my opinion on this means very little anyway. Something is always better than nothing, and if they reduce allocation and they are helpful for others then that is great. *Updated cuz I bad at quoting xD

SoftwareGuy commented 5 years ago

I don't think we should reject this PR as from what I'm reading it's more so a case of both sides of the table have equal pros and cons, is there any killer feature or reason that would satisfy both parties?.

I do agree we need some benchmarks and/or test results, even if we use the Pong example from Mirror.

c6burns commented 5 years ago

The only real con is that it increases the API surface. That's not bad, and I admit that I knee jerk reacted to this because I want a nice, neat interface and utterly minimal API surface. A good plan today is better than a great plan tomorrow, and I agree this is a good plan today

martindevans commented 5 years ago

I do agree we need some benchmarks and/or test results, even if we use the Pong example from Mirror.

I'll be happy to put together a benchmark, but what exactly do you want it to demonstrate? Would you be happy with a benchmark which shows a reduction in total allocations-per-second, or perhaps an increased time-between-collections?

Demonstrating an actual speedup in the collection pass itself will be quite difficult because there are two factors at play here - collections take a certain amount of time (based on heap complexity) and happen at a certain frequency (based on allocation rate). The point of these changes is to reduce frequency of collections, not duration of collections when they do eventually happen.

I guess in case anyone isn't entirely sure of my motivations for worrying about GC, here's an overview: every allocation pushes you closer to an inevitable GC collection which will happen eventually (unless your entire game is truly zero GC alloc, basically impossible in Unity), when it does happen there are two possibilities:

Your game is quite simple, it does not need anything close to the frame time budget (15ms for 60FPS, 11ms for 90FPS). The GC collection causes no problem because the extra time it adds to the frame isn't a problem. This is going to be any simple benchmark game.
The game is quite complex, it needs close to the frame time budget. The GC collections pushes you just over that edge and you drop one or more frames.

The reason this makes me worry (and it's hard to profile) is that I could profile the demo scene (e.g. ping pong) all day as it made heavy allocations and not see any problems, the collection passes would take under a millisecond and the game itself doesn't need even close to the remaining 15ms to render the frame. However once you move into a real game scenario the heap is much more complex (collections take longer), there are many more allocations happening in the game itself (collections happen more frequently) and the game needs more time to render a frame (spare time budget per frame is smaller). So a library such as Dissonance or Mirror allocating lots could push GC over the edge of causing problems even though it's never caused an issue in a simple benchmark scene. Even worse the problems would not be easy to track down to the library in the profiler because the large obvious delay (GC.Alloc) happens on whichever random allocation triggered the collection pass.

Hopefully this answers a few questions people have alluded to in various comments:

Why am I not worried about stack allocations (e.g. new ArraySegment)? It doesn't involve the GC.
Why am I not worried about unmanaged/native heap allocations? A single one is slower to allocate, but it never triggers a GC so amortised cost is similar (but more predictable).
What about extra copying? Copying is crazy fast, I would gladly trade in ~10 extra copies of every packet for a single reduced GC allocation.
What problem does excessive GC allocation cause? No problem at all until you get close to the frame time limit, and then it causes random frame skips (possibly multiple frames dropped at once).

Span<byte> ... to be clear you are referring to the ref struct implementation that only exists in .NET Framework as a port from .Net Core 2.1

Getting wildly off track here, but yes I'm referring to the new memory primitives introduced in dotnet Core mostly for aspcore perf (Span<> and Memory<>). Read about them here and here if you're unfamiliar with the details. I've no idea about the details of how it was ported into System.Private.CoreLib for Unity or the performance characteristics of that.

TL;DR If I put together a PR including a benchmark showing increased time between allocations will that (likely) be accepted?

uweeby commented 5 years ago

Is it possible to get a new demo game for testing that can have configurable scale? 1 non player AI that is walking/turning/syncing at a known rate. then at the press of a button that can be increased to 10/100/200. That could be used as the standard for stress testing?

If done correctly that might also fit in the CI pipeline?

c6burns commented 5 years ago

Read about them here and here if you're unfamiliar with the details

I am familiar with it, and the links are welcome as everyone should know about the awesome stuff in Core. AFAIK it's only available at this time by using portable assemblies dropped into Unity (System.Memory package), even though I know they have an older Corelib assembly hiding inside the Editor. But I suspect that before long we will see Unity shift off the Mono track over to Core (which other major projects are also doing). JB Evain is even working at MS right now on Unity tools. And even with performance characteristics outside Core such as they are it's quite often a win to use Span/Memory to eliminate memory copying. Also a generic buffer pool is available in MemoryPool, which is very handy in avoiding allocations.

Sadly, none of these solve my issues, but that's another story. The library should cater to broader needs and let people like me fend for themselves :D

Personally re: performance metrics I feel this is a simple, non-theoretical issue. Regardless of system specs, the GC is inevitable. The question is merely how often and how long will it run. I'd still love to see a profile of someone getting murdered by a GC spike, but it's not necessary to prove to me the issue is real. I handle IO outside managed, but for everyone who doesn't (and who is considering high concurrency or even just high throughput) this should be of particular concern. Even for myself it's not much good to send/receive IO at high speeds, but pay a latency penalty waiting for managed to read/respond.

miwarnec commented 5 years ago

@martindevans about GC: at the moment we are more worried about the stuff on the issue tracker, dealing with source based weaver, and so on so that we can actually release our indie games without bugs.

paul's game won't fail because of a few GC spikes here and there, neither will uMMORPG. but they both might fail if there are critical bugs that users encounter that makes the game unplayable / or the server hackable. they also might fail if Paul and I have to track down strange buffer sharing bugs for the next month.

there is a hierarchy. for example, if you teleport back 10k years in time and ask a cavemen if he'd like a faster computer, he'd probably ask you if you were nuts and shove you out of his way so he can find some food and not starve to death tonight.

GC matters, but stability comes first.

martindevans commented 5 years ago

they also might fail if Paul and I have to track down strange buffer sharing bugs for the next month.

I guess you didn't actually look at the changes in the related PR? 😕

It's very simple and doesn't introduce any new buffer recycling mechanics or anything complicated/buggy like that. It's just a new public method for sending a packet that doesn't force allocations (by leaving it up to the user to supply a correctly sized buffer in the first place).

c6burns commented 5 years ago

@martindevans in the very least, transport maintainers can implement this PR code directly inside their own classes that inherit from Mirror.Transport. The irony of the fact that at first I didn't want this, and then ended up arguing for it is not lost on me. You especially couldn't have known that Mirror.Transport was functioning as an interface disguised as an abstract base because of Unity weirdness (which honestly I'm still puzzling over myself). This was a good fix that addressed a real issue, and I hope its exclusion doesn't sour you on the project. This code will still be just as good a fix inside the individual transports, who would have needed to implement these API calls anyway.

martindevans commented 5 years ago

in the very least, transport maintainers can implement this PR code directly inside their own classes

Yep, I've actually already submitted (and merged) a PR that does exactly this for Ignorance. Once that releases I will just have to cast the underlying transport into an instance of Ignorance and use it directly in Dissonance.

I hope its exclusion doesn't sour you on the project

It's a simple difference of opinion over priorities and I can totally understand a position of keeping things as simple as possible :)

paulpach commented 5 years ago

@martindevans I did see the PR, I agree it is not earth shattering, but it seems pointless without any sort of benchmark

martindevans commented 5 years ago

Responding to the comment by @paulpach on related PR in Ignorance repo.

it is easy to reopen a PR, but we need to see numbers.

As I said a few comments up what kind of numbers do you want to see? I guess I can demonstrate longer times between GC collections relatively easily, but demonstrating a quicker GC collection will be quite hard (for the reasons outlined in the same comment). Are you happy with a demonstration of less frequent collections?

paulpach commented 5 years ago

@martindevans set up a server where you synchronize 1000 objects with a few network components each, connect a few clients to it. Connect the profiler to your server and show how long it takes to synchronize them.

See for example this: https://github.com/vis2k/Mirror/pull/299#issuecomment-459745122

Obviously your PR is not going to help per se, as by itself it does not get rid of any allocation, but you can make a POC branch where you do the changes you have in mind and profile it.

imerr commented 5 years ago

Here's the scene/setup I used for those if you want it aswell: StressTest.zip Can just be dropped into the Mirror/Examples folder, object visibility is only updated when you touch the slider/spawn objects It's not 100% optimal (you probably want something more regular, less spiky for fine tuning), but it should work fine for showing significant differences

MirrorNetworking / Mirror