Unity-Technologies / com.unity.netcode.gameobjects

Netcode for GameObjects is a high-level netcode SDK that provides networking capabilities to GameObject/MonoBehaviour workflows within Unity and sits on top of underlying transport layer.
MIT License
2.15k stars 435 forks source link

Corrupted network batch on client during stress testing (Received a packet with an invalid Hash Value.) #2893

Closed SamTheBay closed 4 months ago

SamTheBay commented 7 months ago

Description

I am running a single client process and a single server process both on my local machine. I am running a stress test using my game which has hundreds of network objects with many network variables and also does a large number of RPCs. After about 15-30 minutes I will reproduce an issue where the client gets stuck and stops receiving any new network batches from the server. The client gives the following error log at that time...

"Received a packet with an invalid Hash Value. Please report this to the Netcode for GameObjects..."

I have instrumented the networking code to log out information about every batch on the client and the server. The result shows that the client receives a batch in which 20% of it has been reset to 0 instead of the expected messages. Here is an example...

Batch sent by the server 2024-04-20 10:24:29 30199) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages:(10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(5,56)(5,57)(5,57)(5,57)(5,57)(5,57)(5,69)(5,69)

Batch received by the client 2024-04-20 12:53:18 30195) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages: (10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)

You will notice that the batches hash matches (this is the hash sent by the server not the one computed on the client which doesn't match). However, towards the end of the data on the client the messages all turn to type 0 length 0 which is obviously wrong.

Note that the server seems to be totally healthy. If multiple clients are connected, the other clients continue playing fine. Furthermore, disconnecting and reconnecting the affected client temporarily resolves the issue (without restarting the server).

Reproduce Steps

I have not been able to figure out a deterministic set of steps to reproduce the issue. It seems to be a timing condition in the network transport. Even under an abusive workload it can take 30 minutes to reproduce. However, my players are hitting this regularly on my production servers which include many different client PC's and latency distances from the server.

My project is large and not easy to setup, so I am not sure it is practical for me to share my local repro. However, if you have any experiments you would like me to try, I can run them and report back. Furthermore, I could jump on a call and do pair debugging if you want to look at a repro in the debugger.

Actual Outcome

The client becomes stuck and has to be disconnected and re-connected from the server to recover. This is game breaking because the game uses permadeath mechanics so the disconnect is very dangerous.

Expected Outcome

There should not be any combination of events that leads to the client receiving a corrupted batch that it cannot process.

Environment

Additional Context

If having the full logs of messages between the client and server would be helpful, I can provide them.

fluong6 commented 7 months ago

Hi @SamTheBay, thank you for letting us know about this issue.

To better understand the nature of the issue, can you provide us the transport protocol you are using for your project?

If your project is not using UTP (com.unity.transport), can you try it with UTP and see if you are still encountering this issue?

SamTheBay commented 7 months ago

Yes, I am using UTP (version 2.2.1). Attached is a screenshot of my configuration. Note that I have played around with different sizes for the Max Packet Queue Size and Max Payload Size but they don't make a difference. I have also tried increasing the Spawn Timeout to 10 seconds but that didn't help either.

image

fluong6 commented 7 months ago

@NoelStephensUnity Any ideas?

SamTheBay commented 7 months ago

A new finding as I am running more tests today. This time, I see the following message pattern...

Client Receives 2024-04-23 06:04:56 332) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages: (10,34) 2024-04-23 06:04:56 333) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages: (10,36) -- lots of messages over time here -- 2024-04-23 06:20:48 33477) Hash: 2149821300779223542 Size: 4848 Batch Count: 96 Messages: (10,34)(10,36)(10,48)(10,36)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(5,69)(5,69)(5,57)(5,69) 2024-04-23 06:20:48 33478) Hash: 14205836731104327019 Size: 88 Batch Count: 1 Messages: (5,69) 2024-04-23 06:20:48 33479) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages: (10,34) <-- duplicate 2024-04-23 06:20:48 33480) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages: (10,36) <-- duplicate

Server Sends 2024-04-23 06:04:56 331) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages:(10,34) 2024-04-23 06:04:56 332) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages:(10,36) -- lots of messages over time here -- 2024-04-23 06:20:48 33477) Hash: 2149821300779223542 Size: 4848 Batch Count: 96 Messages:(10,34)(10,36)(10,48)(10,36)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(5,69)(5,69)(5,57)(5,69) 2024-04-23 06:20:48 33478) Hash: 14205836731104327019 Size: 88 Batch Count: 1 Messages:(5,69) 2024-04-23 06:20:48 33479) Hash: 13256763382766322572 Size: 4568 Batch Count: 93 Messages:(10,34)(10,36)(10,44)(10,36)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,57)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,45)(10,37)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,45)(10,49)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,45)(10,49)(10,45)(10,45)(10,49)(10,45)(10,45)(5,57) <-- never shows up on client

So, we see that the messages 16439399040642367407 and 10211885624164295971 get replayed on the client after ~16 minutes even though the server did not send them a second time. After this event, the server keeps on sending messages with all the updates, but the client stops receiving any of them. It only receives time sync messages from that point on...

2024-04-23 06:23:44 33625) Hash: 15766265757608103513 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:45 33626) Hash: 6783467869736931475 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:46 33627) Hash: 5330460179395314989 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:47 33628) Hash: 7196747670436366547 Size: 24 Batch Count: 1 Messages: (17,3)

NoelStephensUnity commented 7 months ago

@SamTheBay Could you either share a small project that includes your stress test or the script you are using for the stress test? Would like to try and replicate this on my end, but need to make sure I am replicating the exact data sent using the exact RPCs (I am assuming you are using the legacy ClientRpc/ServeRpc as opposed to the newer Universal Rpcs).

Once I can replicate the issue, then I can add some more debug information to determine where the duplicated messages are coming from.

SamTheBay commented 7 months ago

@NoelStephensUnity right now my stress test runs my full game with one client and one server. It then continually generates groups of enemies which attack the player and then eventually die. The repro will happen within 15-30 minutes of that running. Let me try hacking it down to something smaller to see if I can get an easier repro to share. Worst case, I can work on making changes so that I can share the whole project without your needing access to the databases. You are correct that I am using the ClientRpc/ServeRpc (and Network Variables and Network Lists).

NoelStephensUnity commented 7 months ago

@SamTheBay All good... don't want you to work too hard it just helps expedite finding where the issue could be. If it turns out to be too big of a task on your side, you can:

You should get a bug ticket number via email when it is received and added to the internal QA DB. If you post that here I can get QA to migrate that issue into the NGO internal bug system and then just run the stress test on my end.
(of course, providing instructions on how to kick off the stress test will also help me get an investigation moving forward faster)

SamTheBay commented 7 months ago

@NoelStephensUnity I have submitted the bug report along with the full project and instructions on how to run the stress test to reproduce the issue. Let me now if you need anything additional from my end. Here is the bug report id...

IN-75033 - Corrupt messages using Unity Transport

Thanks!

NoelStephensUnity commented 7 months ago

@SamTheBay Awesome! This will help us track things down. Will let you know when the issue has been found and fixed (currently it is either in UTP Transport or possibly NGO UnitTransport).

SamTheBay commented 7 months ago

@NoelStephensUnity were you able to reproduce the issue using the project I sent?

NoelStephensUnity commented 7 months ago

@SamTheBay I have deferred this to @ShadauxCat since it is believed to be part of an under investigation issue with UnityTransport. This stress test should help her narrow down the issue faster, but I think she won't be able to get to this until sometime next week. Your submission is getting migrated over into the multiplayer bugs db (possibly today) and so it is definitely in a high priority queue (the issue in general).

SamTheBay commented 6 months ago

@NoelStephensUnity @ShadauxCat is there any update on finding the root cause or an ETA for this? I saw from the bug report that it looks like you have successfully reproduced the issue with the project I sent which is great. I am losing players over this one, so I am very anxious for a fix or work around. Let me know if there is anything else I can provide you that would help. Thanks!

ShadauxCat commented 6 months ago

Hi @SamTheBay - I have not been able to start investigating this yet due to other work that has been on my plate, but this bug is a high priority for us. I expect to start investigating it by the end of this week. I've only just joined the Transport team and the subject matter expert is currently on leave, so it may take me some time to really track down what is going on here, but it is the next item on my todo list.

SamTheBay commented 6 months ago

@ShadauxCat any update?

ShadauxCat commented 6 months ago

I'm actively working on it as we speak, but I don't have any updates to share. I'm unable to reproduce it consistently (I've only managed to reproduce it twice so far in two days of running the stress test almost non stop) so it's slow going trying to get data on what's going on here. I'll let you know as soon as I have anything to share.

SamTheBay commented 6 months ago

@ShadauxCat any progress?

ShadauxCat commented 6 months ago

@SamTheBay I'm still having a really hard time reproducing it. I don't know if it's something about my hardware causing the network conditions to be different, or the logging I've added to try to track down what's going on changing the timing, or what, but it just won't reproduce for me at all now.

If you have time and want to help, it would be really useful if you could pull down a branch with some logging in it, reproduce it (if you can), and then send me the logs from the client the error happened on + the server. (You might have to run it like <executable> -logFile server.log and <executable> -logFile client.log to make sure they go to different log files.)

To get my branch, go to package manager, add from git URL, and use https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_full_codepath_logging as the url.

If I can just get a trace of what's going on when the error happens, it would help me immensely in coming up with a fix.

SamTheBay commented 6 months ago

Ok. I will give it a try soon and let you know. It could be that the tracing itself is changing the timing enough to prevent a repro. We will see if it will repro for me after I switch to your branch.

ShadauxCat commented 6 months ago

Oh, I should warn you, it's a LOT of log output. Please turn off stack traces in your build or it might get to gigabytes of log data.

SamTheBay commented 6 months ago

Ok. I am getting two errors building after grabbing your git...

Library\PackageCache\com.unity.netcode.gameobjects@ca5a2701ba\Editor\AnticipatedNetworkTransformEditor.cs(9,26): error CS0246: The type or namespace name 'AnticipatedNetworkTransform' could not be found (are you missing a using directive or an assembly reference?) Library\PackageCache\com.unity.netcode.gameobjects@ca5a2701ba\Editor\AnticipatedNetworkTransformEditor.cs(12,30): error CS0115: 'AnticipatedNetworkTransformEditor.HideInterpolateValue': no suitable method found to override

Any idea how I should fix these?

ShadauxCat commented 6 months ago

That is unusual. This branch is created directly off of the 1.9.1 release branch with just some log lines added, and AnticipatedNetworkTransform is definitely there in the branch. Do you maybe have a local version of the repo embedded in your project that's conflicting with the one added through package manager?

If not, maybe try closing the editor, deleting your library directory, and launching again to clear the cache?

SamTheBay commented 6 months ago

Oh... yes. I have a local version from my own debugging. My bad. Let me remove it and try again.

ShadauxCat commented 6 months ago

If you aren't able to get it with that branch, can you try https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_lighter_codepath_logging_1.8? It's based on 1.8 instead of 1.9 (since you're using 1.8) and it has reduced logging to try and reduce the impact of logging on the repro.

ShadauxCat commented 6 months ago

And if that one also doesn't work, try https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_even_lighter_codepath_logging_1.8

ShadauxCat commented 6 months ago

Another thing that would be valuable to know - you said you reproduced it with 1.7, can you also reproduce it with 1.6? There were some changes in areas that might be relevant to this issue in 1.7. If it's not reproducible in 1.6, that at least would narrow down the places I need to look to figure out what's happening.

SamTheBay commented 6 months ago

So far, I haven't been able to repro it with the heavy logging. I will try with your lighter logging tomorrow and I will also try with 1.6 and report back.

ShadauxCat commented 6 months ago

I have some coworkers who have been able to reproduce the issue with another project (I haven't been able to reproduce it on my hardware, but they have). I had them test 1.6 yesterday and they confirmed it still happens with 1.6, so you can skip that test.

I added some new logs to the "even lighter" branch this morning. I think that branch is the best bet right now. I think it has the minimum amount of logs possible to still give me the info I need. So I'd say focus on that one today. We're doing some tests internally using said coworkers' project as well, so hopefully between you and them, I'll be able to get what I need to start making progress on a fix.

SamTheBay commented 6 months ago

Great. I have been trying to repro it today with my base build but haven't been successful so far :-(. I am running experiments to see if I can find a way to make it repro more reliably. If I manage that then I will try running your light wight logging.

SamTheBay commented 6 months ago

@ShadauxCat were you able to make any progress based on the repro from another project? I have been struggling to reliably repro it now on my side and don't know what changed. If the other project gives the information you need then I might not need to keep experimenting on my side.

SamTheBay commented 5 months ago

@ShadauxCat checking in once more. Is a repro from me still important given you have one from another project?

ShadauxCat commented 5 months ago

Hey, sorry, I've been out of office this week. Someone else has been continuing the investigation, though, and I saw a message from them this morning that they finally managed to isolate the cause. Now that we know exactly what the logic error is, we can hopefully make fast progress on putting together a fix.

SamTheBay commented 5 months ago

This is amazing news! Thanks for the update. If you come up with a fix in a private branch the let me know since I would be anxious to adopt it if the full release process takes longer.

COV-KaiGai commented 5 months ago

@ShadauxCat Hey, we are experiencing the same issue in our project. When do you expect to have it fixed? Also, once it’s resolved, could you share the fix or a private branch with us?

COV-KaiGai commented 5 months ago

@SamTheBay Issue #2840 Hi there, I’m not sure if this fix will work for you, but I managed to resolve a similar issue by setting the library versions as mentioned in the issue above. Updating Unity Transport (UTP) version to 2.1.0 seemed to be the key. Hope this helps!

@ShadauxCat We are still waiting for an official fix for this issue, and would greatly appreciate it if you could address it soon.

ShadauxCat commented 5 months ago

Hi @SamTheBay and @COV-KaiGai

I believe we've found a fix for this. Initial testing looks good so far.

Unfortunately I can't give you a branch to test on because the fix is in the transport package, which doesn't have a public repo, but we'll get a release out as fast as we can and I'll let you know when that lands.

SamTheBay commented 5 months ago

@ShadauxCat any update on the release for this? Will it be released as a new package version for the Unity Transport? Like version 2.2.2 for example?

ShadauxCat commented 5 months ago

@SamTheBay Transport version 2.3.0 has released today that should have a fix for this issue. Can you retest and confirm for us?

SamTheBay commented 5 months ago

This is great! I have upgraded and deployed a new build out to my players. I will wait a few days to see if there are any new repro's.

SamTheBay commented 4 months ago

@ShadauxCat I have not heard of a repro for a couple weeks now so I think this looks resolved. Thanks for the support. I will go ahead and close it.