Closed SamTheBay closed 4 months ago
Hi @SamTheBay, thank you for letting us know about this issue.
To better understand the nature of the issue, can you provide us the transport protocol you are using for your project?
If your project is not using UTP (com.unity.transport), can you try it with UTP and see if you are still encountering this issue?
Yes, I am using UTP (version 2.2.1). Attached is a screenshot of my configuration. Note that I have played around with different sizes for the Max Packet Queue Size and Max Payload Size but they don't make a difference. I have also tried increasing the Spawn Timeout to 10 seconds but that didn't help either.
@NoelStephensUnity Any ideas?
A new finding as I am running more tests today. This time, I see the following message pattern...
Client Receives 2024-04-23 06:04:56 332) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages: (10,34) 2024-04-23 06:04:56 333) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages: (10,36) -- lots of messages over time here -- 2024-04-23 06:20:48 33477) Hash: 2149821300779223542 Size: 4848 Batch Count: 96 Messages: (10,34)(10,36)(10,48)(10,36)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(5,69)(5,69)(5,57)(5,69) 2024-04-23 06:20:48 33478) Hash: 14205836731104327019 Size: 88 Batch Count: 1 Messages: (5,69) 2024-04-23 06:20:48 33479) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages: (10,34) <-- duplicate 2024-04-23 06:20:48 33480) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages: (10,36) <-- duplicate
Server Sends 2024-04-23 06:04:56 331) Hash: 16439399040642367407 Size: 56 Batch Count: 1 Messages:(10,34) 2024-04-23 06:04:56 332) Hash: 10211885624164295971 Size: 56 Batch Count: 1 Messages:(10,36) -- lots of messages over time here -- 2024-04-23 06:20:48 33477) Hash: 2149821300779223542 Size: 4848 Batch Count: 96 Messages:(10,34)(10,36)(10,48)(10,36)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(5,69)(5,69)(5,57)(5,69) 2024-04-23 06:20:48 33478) Hash: 14205836731104327019 Size: 88 Batch Count: 1 Messages:(5,69) 2024-04-23 06:20:48 33479) Hash: 13256763382766322572 Size: 4568 Batch Count: 93 Messages:(10,34)(10,36)(10,44)(10,36)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,45)(10,49)(10,45)(10,37)(10,45)(10,45)(10,36)(10,45)(10,37)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,37)(10,57)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,36)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,45)(10,45)(10,37)(10,45)(10,49)(10,49)(10,49)(10,49)(10,37)(10,49)(10,49)(10,45)(10,49)(10,45)(10,49)(10,45)(10,45)(10,45)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,49)(10,45)(10,49)(10,45)(10,45)(10,49)(10,45)(10,45)(5,57) <-- never shows up on client
So, we see that the messages 16439399040642367407 and 10211885624164295971 get replayed on the client after ~16 minutes even though the server did not send them a second time. After this event, the server keeps on sending messages with all the updates, but the client stops receiving any of them. It only receives time sync messages from that point on...
2024-04-23 06:23:44 33625) Hash: 15766265757608103513 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:45 33626) Hash: 6783467869736931475 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:46 33627) Hash: 5330460179395314989 Size: 24 Batch Count: 1 Messages: (17,3) 2024-04-23 06:23:47 33628) Hash: 7196747670436366547 Size: 24 Batch Count: 1 Messages: (17,3)
@SamTheBay Could you either share a small project that includes your stress test or the script you are using for the stress test? Would like to try and replicate this on my end, but need to make sure I am replicating the exact data sent using the exact RPCs (I am assuming you are using the legacy ClientRpc/ServeRpc as opposed to the newer Universal Rpcs).
Once I can replicate the issue, then I can add some more debug information to determine where the duplicated messages are coming from.
@NoelStephensUnity right now my stress test runs my full game with one client and one server. It then continually generates groups of enemies which attack the player and then eventually die. The repro will happen within 15-30 minutes of that running. Let me try hacking it down to something smaller to see if I can get an easier repro to share. Worst case, I can work on making changes so that I can share the whole project without your needing access to the databases. You are correct that I am using the ClientRpc/ServeRpc (and Network Variables and Network Lists).
@SamTheBay All good... don't want you to work too hard it just helps expedite finding where the issue could be. If it turns out to be too big of a task on your side, you can:
You should get a bug ticket number via email when it is received and added to the internal QA DB. If you post that here I can get QA to migrate that issue into the NGO internal bug system and then just run the stress test on my end.
(of course, providing instructions on how to kick off the stress test will also help me get an investigation moving forward faster)
@NoelStephensUnity I have submitted the bug report along with the full project and instructions on how to run the stress test to reproduce the issue. Let me now if you need anything additional from my end. Here is the bug report id...
IN-75033 - Corrupt messages using Unity Transport
Thanks!
@SamTheBay Awesome! This will help us track things down. Will let you know when the issue has been found and fixed (currently it is either in UTP Transport or possibly NGO UnitTransport).
@NoelStephensUnity were you able to reproduce the issue using the project I sent?
@SamTheBay I have deferred this to @ShadauxCat since it is believed to be part of an under investigation issue with UnityTransport. This stress test should help her narrow down the issue faster, but I think she won't be able to get to this until sometime next week. Your submission is getting migrated over into the multiplayer bugs db (possibly today) and so it is definitely in a high priority queue (the issue in general).
@NoelStephensUnity @ShadauxCat is there any update on finding the root cause or an ETA for this? I saw from the bug report that it looks like you have successfully reproduced the issue with the project I sent which is great. I am losing players over this one, so I am very anxious for a fix or work around. Let me know if there is anything else I can provide you that would help. Thanks!
Hi @SamTheBay - I have not been able to start investigating this yet due to other work that has been on my plate, but this bug is a high priority for us. I expect to start investigating it by the end of this week. I've only just joined the Transport team and the subject matter expert is currently on leave, so it may take me some time to really track down what is going on here, but it is the next item on my todo list.
@ShadauxCat any update?
I'm actively working on it as we speak, but I don't have any updates to share. I'm unable to reproduce it consistently (I've only managed to reproduce it twice so far in two days of running the stress test almost non stop) so it's slow going trying to get data on what's going on here. I'll let you know as soon as I have anything to share.
@ShadauxCat any progress?
@SamTheBay I'm still having a really hard time reproducing it. I don't know if it's something about my hardware causing the network conditions to be different, or the logging I've added to try to track down what's going on changing the timing, or what, but it just won't reproduce for me at all now.
If you have time and want to help, it would be really useful if you could pull down a branch with some logging in it, reproduce it (if you can), and then send me the logs from the client the error happened on + the server. (You might have to run it like <executable> -logFile server.log
and <executable> -logFile client.log
to make sure they go to different log files.)
To get my branch, go to package manager, add from git URL, and use https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_full_codepath_logging
as the url.
If I can just get a trace of what's going on when the error happens, it would help me immensely in coming up with a fix.
Ok. I will give it a try soon and let you know. It could be that the tracing itself is changing the timing enough to prevent a repro. We will see if it will repro for me after I switch to your branch.
Oh, I should warn you, it's a LOT of log output. Please turn off stack traces in your build or it might get to gigabytes of log data.
Ok. I am getting two errors building after grabbing your git...
Library\PackageCache\com.unity.netcode.gameobjects@ca5a2701ba\Editor\AnticipatedNetworkTransformEditor.cs(9,26): error CS0246: The type or namespace name 'AnticipatedNetworkTransform' could not be found (are you missing a using directive or an assembly reference?) Library\PackageCache\com.unity.netcode.gameobjects@ca5a2701ba\Editor\AnticipatedNetworkTransformEditor.cs(12,30): error CS0115: 'AnticipatedNetworkTransformEditor.HideInterpolateValue': no suitable method found to override
Any idea how I should fix these?
That is unusual. This branch is created directly off of the 1.9.1 release branch with just some log lines added, and AnticipatedNetworkTransform is definitely there in the branch. Do you maybe have a local version of the repo embedded in your project that's conflicting with the one added through package manager?
If not, maybe try closing the editor, deleting your library directory, and launching again to clear the cache?
Oh... yes. I have a local version from my own debugging. My bad. Let me remove it and try again.
If you aren't able to get it with that branch, can you try https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_lighter_codepath_logging_1.8
? It's based on 1.8 instead of 1.9 (since you're using 1.8) and it has reduced logging to try and reduce the impact of logging on the repro.
And if that one also doesn't work, try https://github.com/Unity-Technologies/com.unity.netcode.gameobjects.git?path=/com.unity.netcode.gameobjects#debug/utp_even_lighter_codepath_logging_1.8
Another thing that would be valuable to know - you said you reproduced it with 1.7, can you also reproduce it with 1.6? There were some changes in areas that might be relevant to this issue in 1.7. If it's not reproducible in 1.6, that at least would narrow down the places I need to look to figure out what's happening.
So far, I haven't been able to repro it with the heavy logging. I will try with your lighter logging tomorrow and I will also try with 1.6 and report back.
I have some coworkers who have been able to reproduce the issue with another project (I haven't been able to reproduce it on my hardware, but they have). I had them test 1.6 yesterday and they confirmed it still happens with 1.6, so you can skip that test.
I added some new logs to the "even lighter" branch this morning. I think that branch is the best bet right now. I think it has the minimum amount of logs possible to still give me the info I need. So I'd say focus on that one today. We're doing some tests internally using said coworkers' project as well, so hopefully between you and them, I'll be able to get what I need to start making progress on a fix.
Great. I have been trying to repro it today with my base build but haven't been successful so far :-(. I am running experiments to see if I can find a way to make it repro more reliably. If I manage that then I will try running your light wight logging.
@ShadauxCat were you able to make any progress based on the repro from another project? I have been struggling to reliably repro it now on my side and don't know what changed. If the other project gives the information you need then I might not need to keep experimenting on my side.
@ShadauxCat checking in once more. Is a repro from me still important given you have one from another project?
Hey, sorry, I've been out of office this week. Someone else has been continuing the investigation, though, and I saw a message from them this morning that they finally managed to isolate the cause. Now that we know exactly what the logic error is, we can hopefully make fast progress on putting together a fix.
This is amazing news! Thanks for the update. If you come up with a fix in a private branch the let me know since I would be anxious to adopt it if the full release process takes longer.
@ShadauxCat Hey, we are experiencing the same issue in our project. When do you expect to have it fixed? Also, once it’s resolved, could you share the fix or a private branch with us?
@SamTheBay Issue #2840 Hi there, I’m not sure if this fix will work for you, but I managed to resolve a similar issue by setting the library versions as mentioned in the issue above. Updating Unity Transport (UTP) version to 2.1.0 seemed to be the key. Hope this helps!
@ShadauxCat We are still waiting for an official fix for this issue, and would greatly appreciate it if you could address it soon.
Hi @SamTheBay and @COV-KaiGai
I believe we've found a fix for this. Initial testing looks good so far.
Unfortunately I can't give you a branch to test on because the fix is in the transport package, which doesn't have a public repo, but we'll get a release out as fast as we can and I'll let you know when that lands.
@ShadauxCat any update on the release for this? Will it be released as a new package version for the Unity Transport? Like version 2.2.2 for example?
@SamTheBay Transport version 2.3.0 has released today that should have a fix for this issue. Can you retest and confirm for us?
This is great! I have upgraded and deployed a new build out to my players. I will wait a few days to see if there are any new repro's.
@ShadauxCat I have not heard of a repro for a couple weeks now so I think this looks resolved. Thanks for the support. I will go ahead and close it.
Description
I am running a single client process and a single server process both on my local machine. I am running a stress test using my game which has hundreds of network objects with many network variables and also does a large number of RPCs. After about 15-30 minutes I will reproduce an issue where the client gets stuck and stops receiving any new network batches from the server. The client gives the following error log at that time...
"Received a packet with an invalid Hash Value. Please report this to the Netcode for GameObjects..."
I have instrumented the networking code to log out information about every batch on the client and the server. The result shows that the client receives a batch in which 20% of it has been reset to 0 instead of the expected messages. Here is an example...
Batch sent by the server 2024-04-20 10:24:29 30199) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages:(10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,50)(10,49)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(5,56)(5,57)(5,57)(5,57)(5,57)(5,57)(5,69)(5,69)
Batch received by the client 2024-04-20 12:53:18 30195) Hash: 2898211147757380501 Size: 5728 Batch Count: 111 Messages: (10,34)(10,48)(10,46)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,46)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,36)(10,45)(10,49)(10,49)(10,50)(10,45)(10,45)(10,50)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,50)(10,49)(10,45)(10,46)(10,49)(10,45)(10,38)(10,45)(10,49)(10,45)(10,45)(10,49)(10,49)(10,49)(10,45)(10,49)(10,49)(10,45)(10,45)(10,36)(10,49)(10,49)(10,49)(10,49)(10,49)(10,49)(10,45)(10,45)(10,49)(10,46)(10,49)(10,49)(10,49)(10,49)(10,49)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)(0,0)
You will notice that the batches hash matches (this is the hash sent by the server not the one computed on the client which doesn't match). However, towards the end of the data on the client the messages all turn to type 0 length 0 which is obviously wrong.
Note that the server seems to be totally healthy. If multiple clients are connected, the other clients continue playing fine. Furthermore, disconnecting and reconnecting the affected client temporarily resolves the issue (without restarting the server).
Reproduce Steps
I have not been able to figure out a deterministic set of steps to reproduce the issue. It seems to be a timing condition in the network transport. Even under an abusive workload it can take 30 minutes to reproduce. However, my players are hitting this regularly on my production servers which include many different client PC's and latency distances from the server.
My project is large and not easy to setup, so I am not sure it is practical for me to share my local repro. However, if you have any experiments you would like me to try, I can run them and report back. Furthermore, I could jump on a call and do pair debugging if you want to look at a repro in the debugger.
Actual Outcome
The client becomes stuck and has to be disconnected and re-connected from the server to recover. This is game breaking because the game uses permadeath mechanics so the disconnect is very dangerous.
Expected Outcome
There should not be any combination of events that leads to the client receiving a corrupted batch that it cannot process.
Environment
Additional Context
If having the full logs of messages between the client and server would be helpful, I can provide them.