Closed timfel closed 8 years ago
If, however we need the randomness, then we invite the low level scarry monsters. All out-of-sync hell seems to result from the need to produce exactly the same randomness on each participant computer.
I don't think the normal way of using randomness (SyncRand()) is causing any issues. In Wyrmsun entire maps are generated randomly with synchronized randomness, and all players end up with exactly the same map. The thing is, the random seed is changed whenever a random number is generated. so the randomness is probably just telling us that an action happened in one computer but not the other, rather than causing the desync itself.
Wyrmsun entire maps are generated randomly with synchronized randomness, and all players end up with exactly the same map.
@Andrettin, maybe they just seem the same? Do computers compare them?
Are mines locations compared across the clients in case of randomly generated maps? In case they are not, maybe my peasants in my game walk 10 cm and in your 11 cm? At the begining there's no problem for any of us, so the game continues. But I collect money faster. Then I send you the command that I am building the Ogre Moud, which you cannot redo in your game, since there's not enough money in my Town Hall on your computer.
Heh, I didn't quite get the feeling of theirs "butterfly and tsunami" example and how hard it is to find the origins of the problem, until I came up with my own example above :) Checksuming everything we can and comparing seems indeed essential, because we can probably invent more of the above devil examples.
Are mines locations compared across the clients in case of randomly generated maps? In case they are not, maybe my peasants in my game walk 10 cm and in your 11 cm? At the begining there's no problem for any of us, so the game continues. But I collect money faster. Then I send you the command that I am building the Ogre Moud, which you cannot redo in your game, since there's not enough money in my Town Hall on your computer.
If the random seed were different, then the building would likely be in a completely different location, rather than just slightly different. Also, since the engine is tile-based a position difference would have quite drastic effects on gameplay, which would quickly cause a desync.
In any case, I've pushed a commit fixing the MyRand() issue; this removes at least one desync cause.
If the random seed were different, then the building would likely be in a completely different location, rather than just slightly different. I agree. I am not sure I am talking about randomSeed, because it is an integer. I am just saying that the places where floats appear should be double checked since they are not guarantied to produce the same calculations on different computers. Enough that one client's AI makes float z=sin(x) and sends a critter (I would just remove all these annoying bastards out of the game! :) at 0.0001 radian too much to the north and after 10 minutes someone cannot build a farm there. And who controlls critters? AI controlls them. So they wouldn't be sent through network, right? So my game has my critters, and your game has yours and we never check against their locations. The next test I try will be the map full of critters :)
And, but this is just a guess, the floats could work better for some operations, maybe for sqrt() - maybe there the difference in floats doesn't matter that much. And maybe sqrt() is for map generation, so maps are generated just fine. But maybe sin() has such an implementation on my hardware/software, that my results are just too much different from yours.
In any case, I've pushed a commit fixing the MyRand() issue; this removes at least one desync cause. Thanks! It's good that this issue is so alive latelly :) If there's an idea that some debugging environment should be built in python or such, I could be part of it.
Karol Kreński
--- Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-182422144
@Andrettin I had replaced all the MyRand with SyncRand calls in the network-issues branch, that doesn't fix it. Also, afaict unit direction isn't at all used in determining unit choices. The problem i'm tracing it back to right now is sub-tile movement, which I traced further back to floating-point randomness.
And (also on that branch) I fixed the RandomSeed changing, because that isn't actually deterministic either, even though it's just dealing with unsigned. The bit-width of that type is not fixed across compilers and platforms, and overflow behavior is not the same at different optimization levels of gcc.
If you dump all unit info from the server and clients (-i
) on a map that is ~60% full of units that are moving about, I find sub-tile deviations already after just 200 cycles.
Also, desyncs happen even without critters, in an otherwise empty map with only peasants and trees. The desync happens when the sub-tile differences amount to enough so that the AI sends a different peasant to harvest.
also, I have not been able to reproduce this when I run only Visual Studio debug builds. One reason may be that floating point arithmetic is by default forced to be precise, even if that is slower. On gcc, on the other hand, it is not so easy to force that.
Also, desyncs happen even without critters, in an otherwise empty map No critters, but the tiles this time, fine. But we haven't looked at critters potential just yet :) Basically we should be expecting multiple causes to out-of-sync - this is what other developers experienced.
Karol Kreński
Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-182493301
@timfel Great to hear that you've made so much progress in pinpointing the issue!
One question about this, though:
And (also on that branch) I fixed the RandomSeed changing
Wouldn't that mean that SyncRand() will always return the same number?
@Andrettin no, I merely changed SyncRand to use 32bit clamped operations (without the multiplication, which could previously result in undefined behavior due to overflowing and then clamping on the network)
Ah, I see, thanks :)
Wouldn't that mean that SyncRand() will always return the same number? I don't know if it comes from SyncRand(), but there is some fixed number. This is my log_of_stratagus_0.log in stratagus (network-issues branch): It starts from another number, but quickly stalls at 16700 in two games I tried.
2: 17 unit-peasant 1 P0 Refs 1: 16700 56,110 0,0 2: 18 unit-peasant 1 P0 Refs 1: 16700 57,110 0,0 2: 19 unit-peasant 1 P0 Refs 1: 16700 56,111 0,0 2: 20 unit-peasant 1 P0 Refs 1: 16700 57,111 0,0 (...) 14529: 39 unit-critter 1 P15 Refs 1: 16700 30,66 0,0 14529: 40 unit-oil-patch 1 P15 Refs 1: 16700 89,49 0,0 14529: 41 unit-oil-patch 1 P15 Refs 1: 16700 17,113 0,0
I move my units around, but 16700 never changes. And it wasn't like that before - 16700 was changing to various values.
@timfel Great to hear that you've made so much progress in pinpointing the issue! That's right Tim! :)
Karol Kreński
@mimooh syncrand still needs tweaking, that stalling is expected with the current code, i just wanted to see if the general approach would change anything.
@Andrettin I have printf-debugged my way back from the desync I'm seeing to AStar. The diff below compares log from client and server. You can see AStar generating different paths, which eventually leads to the unit (number 738) moving differently and ending up on different tiles, causing the desync:
20: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-20: 738 unit-peasant P12: AStarFindPath pos:204,66 goal:193,67 goalsz:3,3 unitsz:1,1, range:0,1, path: { 6, 6, 7, 5, 6, 6, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-20: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7, 5, 6, 6, 6, 5 }
+20: 738 unit-peasant P12: AStarFindPath pos:204,66 goal:193,67 goalsz:3,3 unitsz:1,1, range:0,1, path: { 5, 6, 7, 5, 6, 6, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
+20: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7, 5, 6, 6, 6, 5 }
20: 738 unit-peasant Moving P12 through NextPathElement, result is: 8, posd.x/y: -1,1
20: 738 unit-peasant Moving P12 posdx/y: -1,1 before heading up ix/y: 32,-32
20: 738 unit-peasant Moving P12 posdx/y: -1,1 after heading up ix/y: 32,-32
69: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-69: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7, 5, 6, 6, 6 }
+69: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7, 5, 6, 6, 6 }
69: 738 unit-peasant Moving P12 through NextPathElement, result is: 7, posd.x/y: -1,0
69: 738 unit-peasant Moving P12 posdx/y: -1,0 before heading up ix/y: 32,0
69: 738 unit-peasant Moving P12 posdx/y: -1,0 after heading up ix/y: 32,0
118: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-118: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7, 5, 6, 6 }
+118: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7, 5, 6, 6 }
118: 738 unit-peasant Moving P12 through NextPathElement, result is: 6, posd.x/y: -1,0
118: 738 unit-peasant Moving P12 posdx/y: -1,0 before heading up ix/y: 32,0
118: 738 unit-peasant Moving P12 posdx/y: -1,0 after heading up ix/y: 32,0
167: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-167: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7, 5, 6 }
+167: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7, 5, 6 }
167: 738 unit-peasant Moving P12 through NextPathElement, result is: 5, posd.x/y: -1,0
167: 738 unit-peasant Moving P12 posdx/y: -1,0 before heading up ix/y: 32,0
167: 738 unit-peasant Moving P12 posdx/y: -1,0 after heading up ix/y: 32,0
216: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-216: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7, 5 }
+216: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7, 5 }
216: 738 unit-peasant Moving P12 through NextPathElement, result is: 4, posd.x/y: -1,1
216: 738 unit-peasant Moving P12 posdx/y: -1,1 before heading up ix/y: 32,-32
216: 738 unit-peasant Moving P12 posdx/y: -1,1 after heading up ix/y: 32,-32
265: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-265: 738 unit-peasant Moving P12, output Path is: { 6, 6, 7 }
+265: 738 unit-peasant Moving P12, output Path is: { 5, 6, 7 }
265: 738 unit-peasant Moving P12 through NextPathElement, result is: 3, posd.x/y: -1,-1
265: 738 unit-peasant Moving P12 posdx/y: -1,-1 before heading up ix/y: 32,32
265: 738 unit-peasant Moving P12 posdx/y: -1,-1 after heading up ix/y: 32,32
314: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-314: 738 unit-peasant Moving P12, output Path is: { 6, 6 }
+314: 738 unit-peasant Moving P12, output Path is: { 5, 6 }
314: 738 unit-peasant Moving P12 through NextPathElement, result is: 2, posd.x/y: -1,0
314: 738 unit-peasant Moving P12 posdx/y: -1,0 before heading up ix/y: 32,0
314: 738 unit-peasant Moving P12 posdx/y: -1,0 after heading up ix/y: 32,0
363: 738 unit-peasant Getting resource at P12 to -1,-1 or 194,68
-363: 738 unit-peasant Moving P12, output Path is: { 6 }
-363: 738 unit-peasant Moving P12 through NextPathElement, result is: 1, posd.x/y: -1,0
-363: 738 unit-peasant Moving P12 posdx/y: -1,0 before heading up ix/y: 32,0
-363: 738 unit-peasant Moving P12 posdx/y: -1,0 after heading up ix/y: 32,0
-363: 738 unit-peasant Moving P12 posdx/y: -1,0 move:192 unitDir:4, ix/y: 32,0
-363: 738 unit-peasant 19 P12 Refs 3: 16700 196,67 28,0
+363: 738 unit-peasant Moving P12, output Path is: { 5 }
+363: 738 unit-peasant Moving P12 through NextPathElement, result is: 1, posd.x/y: -1,1
+363: 738 unit-peasant Moving P12 posdx/y: -1,1 before heading up ix/y: 32,-32
+363: 738 unit-peasant Moving P12 posdx/y: -1,1 after heading up ix/y: 32,-32
+363: 738 unit-peasant Moving P12 posdx/y: -1,1 move:160 unitDir:4, ix/y: 32,-32
+363: 738 unit-peasant 19 P12 Refs 3: 16700 196,68 28,-28
@timfel Ah, I see. Is there any reason we can't use integers instead of floats in those calculations?
I don't see where we do use floats in there, a quick search didn't show anything obvious to me.
Is sin() not problematic for stratagus?
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ In 32-bit Ubuntu 12.04 with glibc 2.15 the run-time sin() would use fsin making for significant differences depending on whether sin() was calculated at run-time or compile-time. On Ubuntu 12.04 this code does not print 1.0:
const double pi_d = 3.14159265358979323846;
const int zero = argc / 99;
printf(“%f\n”, sin(pi_d + zero) / sin(pi_d));
0.999967
Stratagus chunkparticle.cpp uses: float radians = deg2rad(MyRand() % 360); direction.y = sin(radians);
Also, I guess stratagus is not having this, seems very useful feature:
http://stackoverflow.com/questions/20963419/cross-platform-floating-point-consistancy In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
Does stratagus have a log of what is sent to the net? The -i -p are about logging complete game state and this is different from what goes to the network, right?
These beautiful logs of timfel yesterday differ from my logs. I guess they are not from -i -p?
@mimooh
I don't think chunkparticle.cpp and the other particle code is affecting the network - AFAIK what they do is purely graphical. And Wargus doesn't (last I saw) even make use of the particles, that's something out of the Stratagus games only Doom Wars does.
What you mentioned of hashing the entire relevant state sounds useful, though I'm not sure where to start if we were to do that to Stratagus.
@timfel Ah, I see, I thought there were floats in the movement calculation. By the way, there is a bug in Stratagus which affects unit facing (it doesn't round the unit facing properly), which I fixed for Wyrmsun. Do you think it could be related to the desync issue?
Another thing: I just noticed that for Wyrmsun I added the following code in CreateGame (in game.cpp), below "InitPlayers();":
if (IsNetworkGame()) { // if is a network game, it is necessary to reinitialize the syncrand variables before beginning to load the map, due to random map generation
SyncHash = 0;
InitSyncRand();
}
Wargus doesn't have random map generation, but maybe it has something that is randomized before starting a game?
Regarding what I mentioned previously about the bug in facing with Stratagus - the direction is sometimes set to a value that is not a multiple of 32 (256 / 8 = 32, since there are 8 directions). This causes some weirdness in the game. I fixed that by adding the following code at the beginning of UnitUpdateHeading (in unit.cpp):
//fix direction if it does not correspond to one of the defined directions
int num_dir = std::max<int>(8, unit.Type->NumDirections);
if (unit.Direction % (256 / num_dir) != 0) {
unit.Direction = unit.Direction - (unit.Direction % (256 / num_dir));
}
In UnitHeadingFromDeltaXY (in unit.cpp as well), I also replaced "unit.Direction = DirectionToHeading(delta);" with the following code (but it looks superfluous, I probably added this here before implementing the previous fix):
int num_dir = std::max<int>(8, unit.Type->NumDirections);
int heading = DirectionToHeading(delta) + ((256 / num_dir) / 2);
if (heading % (256 / num_dir) != 0) {
heading = heading - (heading % (256 / num_dir));
}
unit.Direction = heading;
@Andrettin sounds like these fixes should be in Stratagus, too. I can check if they help when I get home in ~10hs
Nope, that didn't fix it
When it comes to AI harvesting perhaps some inspirations can be found here: http://www.codeofhonor.com/blog/the-starcraft-path-finding-hack
Also, if it's not floats in stratagus then the other suspicious thing is uninitialized variables: http://www.gamasutra.com/view/news/126022/Opinion_Synchronous_RTS_Engines_And_A_Tale_of_Desyncs.php
Ok, should've used valgrind more thoroughly, rather than just static analysis tools.
==8964== Source and destination overlap in memcpy(0x1953c098, 0x1953c0a0, 232)
==8964== at 0x4C2F71C: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8964== by 0x5EBE1C: AStarReplaceNode(int) (astar.cpp:467)
==8964== by 0x5ECE8D: AStarFindPath(Vec2T<short> const&, Vec2T<short> const&, int, int, int, int, int, int, char*, int, CUnit const&) (astar.cpp:1068)
==8964== by 0x5EE8D7: NewPath(PathFinderInput&, PathFinderOutput&) (pathfinder.cpp:340)
==8964== by 0x5EEBF7: NextPathElement(CUnit&, short*, short*) (pathfinder.cpp:395)
==8964== by 0x527BAD: DoActionMove(CUnit&) (action_move.cpp:150)
==8964== by 0x52BAE4: COrder_Resource::MoveToResource_Terrain(CUnit&) (action_resource.cpp:375)
==8964== by 0x52BD99: COrder_Resource::MoveToResource(CUnit&) (action_resource.cpp:454)
==8964== by 0x52E67B: COrder_Resource::Execute(CUnit&) (action_resource.cpp:1263)
==8964== by 0x5370BD: HandleUnitAction(CUnit&) (actions.cpp:401)
==8964== by 0x53774A: void UnitActionsEachCycle<__gnu_cxx::__normal_iterator<CUnit**, std::vector<CUnit*, std::allocator<CUnit*> > > >(__gnu_cxx::__normal_iterator<CUnit**, std::vector<CUnit*, std::allocator<CUnit*> > >, __gnu_cxx::__normal_iterator<CUnit**, std::vector<CUnit*, std::allocator<CUnit*> > >) (actions.cpp:492)
==8964== by 0x53743F: UnitActions() (actions.cpp:539)
After fixing this, I no longer get desyncs on my test map. The change is simply this:
- //memmove(&OpenSet[pos], &OpenSet[pos+1], sizeof(Open) * (OpenSetSize-pos));
- memcpy(&OpenSet[pos], &OpenSet[pos + 1], sizeof(Open) * (OpenSetSize - pos));
+ memmove(&OpenSet[pos], &OpenSet[pos+1], sizeof(Open) * (OpenSetSize-pos));
The previous commit that commented out memmove and replaced it with memcpy was in 2008: 5b6bc1aee657fe3e6468a16fa6fc72022194b7b0. Before that, the memmove has been in there since 2006: db93e5551df933631d0f25f9e8967cc751671748
So, @mimooh, I guess I'd be glad for you to try the branch again and see if it still desyncs for you.
@timfel that's great, thank you very much! :)
It turns out this fix was already in Wyrmgus, which is why I wasn't experiencing the bug with Wyrmsun. For reference, here's the commit: https://github.com/Andrettin/Wyrmgus/commit/ddd1db42e913cc7df68828469680d5b153132dbe#diff-8fbd44e1a5fa0d06d4e7834c3fb46d60
The two of you are just wizards! :) Thank you very much and I hope to have it tested later tonight.
Karol Kreński
It turns out this fix was already in Wyrmgus, which is why I wasn't experiencing the bug with Wyrmsun. For reference, here's the commit: https://github.com/Andrettin/Wyrmgus/commit/ddd1db42e913cc7df68828469680d5b153132dbe#diff-8fbd44e1a5fa0d06d4e7834c3fb46d60
Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-183936363
Not a single desync on a heavy localhost game! There were lot's of desyncs previously. Will try a windows vs linux game against a friend, but it looks very promising :) Thanks again for your great work!
Can we have the latest network fixes merged into upstream? I'd like to have it built for windows for the testing.
Simply amazing!
Is it working on windows?
@DinkyDyeAussie: yes, the changes are merged to the official version, so you can try the exe at the bottom of this page: https://github.com/Wargus/stratagus
Let us know if you have any desyncs. I will have it tested in a windows against linux game this week hopefully.
will test on linux armv7 versus linux amd64
Can I run a remote server (over ssh perhaps) and then have me and a friend connect to it? There's a 'dedicated' option, which sounds like it, but it talks about some AI clients. I'd like to play a human vs human game.
@mimooh It's not really dedicated, it will run as an AI player. There is no human vs human dedicated server option.
@timfel Speaking of servers, do you know what a metaserver is, or how such a thing works? Stratagus has unfinished metaserver code, which could be further developed later on.
@Andrettin I have no idea how the metaserver was supposed to work, I haven't looked at that code
@timfel But do you know what a "metaserver" even is? I have no idea, and I couldn't find much information on it on the internet either :(
It is a server for listing games and allowing people to play together as I understand it. network.cpp relates metaserver to bnetd, which is a battle.net-like software.
The game works just perfect in network. Today we had two 4 players games under ubuntu, which were not disturbed by out-of-sync problems, not even once. So I'd say the issue is pretty much fixed. Thanks once more for sorting this issue out! :)
Firstly thanks so much to @timfel @Andrettin and @mimooh for getting this sorted. Really really appreciated by me :video_game:
You say that the download link is at the bottom of the main page of the repository. When I try that takes it takes me to https://github.com/Wargus/stratagus/releases which says that the "master-build" release hasn't been modified since 30 November, so presumably this can't be it - can it?
Also what about the Ubuntu release - on https://launchpad.net/~stratagus/+archive/ubuntu/ppa the latest version of the wargus package shown is 2.4.0-0~1968~ubuntu15.10.1 which is dated 8th January - so this too surely can't be the right one (EDIT: I see that the stratagus package has been updated and since wargus depends on stratagus presumably that update will get installed automatically)
Thanks again
@elicoten, as it comes to ubuntu, the change was in stratagus only - wargus package was fine all the time. As it comes to windows, I guess wargus displays wrong dates, because I installed wargus from there and the bug which I reported in January 2016 was gone.
I've addressed three more of the above that were specific to Wargus. The preparation room and map quality thing I'm ignoring for now. Adding a few well-made high-quality maps could be a good task for someone who likes making maps :)
(reported as Bug #1518749 by Karol Krenski - https://bugs.launchpad.net/stratagus/+bug/1518749)
I find regular bugs in the game, that narrowing them seems pointless to me. Some of the bugs were already reported by me and by other people. I can produce these bugs in various linux conditions with ease. I play wargus under various ubuntu versions which I compile the fresh stratagus/wargus myself. I could attach logs (wargus -p -i), but since the bugs seem so easily reproducable and obvious, I think there's no point. I will just enumerate the topics:
Look, I really hope stratagus/wargus will work great someday. I love the features, the high resolution and that it's opensource. I could try the regular warcraft from Blizzard, but I still choose to stay with wargus, despite the bugs. My kids and my old friends for LAN parties - we all hope that someday we can play undisturbed network wargus games.