Wargus / stratagus

The Stratagus strategy game engine
GNU General Public License v2.0
615 stars 115 forks source link

bugs, bugs, bugs #150

Closed timfel closed 8 years ago

timfel commented 8 years ago

(reported as Bug #1518749 by Karol Krenski - https://bugs.launchpad.net/stratagus/+bug/1518749)

I find regular bugs in the game, that narrowing them seems pointless to me. Some of the bugs were already reported by me and by other people. I can produce these bugs in various linux conditions with ease. I play wargus under various ubuntu versions which I compile the fresh stratagus/wargus myself. I could attach logs (wargus -p -i), but since the bugs seem so easily reproducable and obvious, I think there's no point. I will just enumerate the topics:

Look, I really hope stratagus/wargus will work great someday. I love the features, the high resolution and that it's opensource. I could try the regular warcraft from Blizzard, but I still choose to stay with wargus, despite the bugs. My kids and my old friends for LAN parties - we all hope that someday we can play undisturbed network wargus games.

mimooh commented 8 years ago

It crashes :(

On 03/02/2016, mizukami999 notifications@github.com wrote:

And the session crashes? )


Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-179423747

mimooh commented 8 years ago

Perhaps "out of sync" is linux specific only. Could someone check it under windows? Just "out of sync" is not what crashes the game. More and more desyncs seem to result in "player sent bad command", which is fatal and then at least some players must be dropped and the game can continue as I can recall.

mizukami999 commented 8 years ago

Only use it on Windows. I'm no programmer, but I wonder if there's no easy procedure to cure that. I don't know - pause the game and resync everything.

I'll try to test what happens if I lower max unit count in wargus/scripts/stratagus.lua

mimooh commented 8 years ago

Only use it on Windows. I'm no programmer, but I wonder if there's no proper procedure to cure that. I don't know - pause the game and resync everything. We tried that. Once there were massive desyncs we would stop touching the mouse, but it wouldn't prevent the crash.

I'll try to test what happens if I lower max unit count in wargus/scripts/stratagus.lua Yes, that should lower the desyncing. I have done the opposite: created a massive map and managed to crash the game quickly (5 minutes perhaps).

Karol Kreński


Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-180797312

mizukami999 commented 8 years ago

Manually is one thing. What I meant was whether it's hard to fix the network code so that whenever the session starts desyncing, the game just pauses itself and resyncs. Like, reloads the units completely and stuff.

mimooh commented 8 years ago

Manually is one thing. What I meant was whether it's hard to fix the network code so that wherever the session starts desyncing, the game just pauses itself and resyncs. Like, reloads the units completely and stuff. Wow, interesting! :) I don't speak C, so perhaps someone knowledgable could comment on that. From a player's perspective I would love that resolution - minor alternation in my units features shouldn't matter much.

Karol Kreński


Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-180805106

mizukami999 commented 8 years ago

Yeah! I mean, how great it would be, like, to meet a C speaker, who's, like, "fix a sync problem? - pff. I can fix a thousand sync problems, like, in a minute!"

timfel commented 8 years ago

The problem is that none of the people who wrote that network code are still actively working on Stratagus. Understanding it completely is not much easier than re-writing it from scratch, especially since the original design was heavily driven by the desire to support even 28.8 kbit/s modems with minimal lag. Today, this really isn't a requirement anymore, so many things could be simplified. It's just a question of time. And then the other problem is that I cannot even debug this, since I cannot reproduce it :(

Andrettin commented 8 years ago

The problem is that none of the people who wrote that network code are still actively working on Stratagus. Understanding it completely is not much easier than re-writing it from scratch, especially since the original design was heavily driven by the desire to support even 28.8 kbit/s modems with minimal lag. Today, this really isn't a requirement anymore, so many things could be simplified. It's just a question of time. And then the other problem is that I cannot even debug this, since I cannot reproduce it :(

Yeah, specially not being able to reproduce it is a very big hurdle. I cannot reproduce it either, and neither could Kyran and cybermind. The bug seems to affect some people nearly every time they play, and others not at all. Additionally, it could be a Lua issue with Wargus, rather than having to do with Stratagus' C++ code (since for Wyrmsun no one has reported the "player sent bad command" issue since I fixed the problem with the chat messages).

One thing that could help though is making the "DebugPrint" calls which indicate errors print the same text to stderr.txt as well, so that when mimooh and others play and get the issue, it is possible that there will be some indication of what the error was in stderr.txt.

mimooh commented 8 years ago

not much easier than re-writing it from scratch, especially since the original design was heavily driven by the desire to support even 28.8 kbit/s modems with minimal lag. Today, this really isn't a requirement anymore, so many things could be simplified. If rewriting would be considered I once again mention this raknet thing.

It's just a question of time. That sounds optimistic :)

And then the other problem is that I cannot even debug this, since I cannot reproduce it :( If you tried and haven't got affected then maybe it is OS related? Perhaps it's not that much broken - developers say they have thoroughly tested the network gameplay and it worked for them. I tried under various ubuntu versions over last 3 years or so - it crashes. Could it be the problem of the version of warcraft2 dataset? I can privately share my warcraft2 copy.

Karol Kreński

--- Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-180891778

mimooh commented 8 years ago

Yeah, specially not being able to reproduce it is a very big hurdle. I cannot reproduce it either, and neither could Kyran and cybermind. The bug seems to affect some people nearly every time they play, and others not at all. Additionally, it could be a Lua issue with Wargus, Wow, I sort of thought that the desyncs affect the majority of users and they are just hard to fix. Fist thing is I should not discourage new users by saying "out of sync" is always there - you should have told me that! :) The other thing is that I can help with narrowing the conditions for the game to desync. Specially thru this "DebugPrint" mechnism that you mention. You think trying various warcraft2 datasets would make sense too?

Karol Kreński

timfel commented 8 years ago

This didn't leave me any peace, and now I was able to reproduce it with Warcraft 1 dataset (running a network game with ~200 units, 12 players, and 256x256 map between mixed Ubuntu & Windows computers). Anyway, I pushed a small change to force a Network recovery when this happens, and then only print the error message if it happens again, and otherwise report if the network is back in sync on the next message. And indeed, I now get alternating "Recovering" ... "Back in sync" messages. This might allow the game to continue, but the problem is definitely deeper - almost every second command triggers an out-of-sync, pretty sure that's not the way it's intended to be.

timfel commented 8 years ago

@Andrettin can you take a look at https://github.com/Wargus/stratagus/commit/7ee413e51faa6a785386b0b7112d45d8174276cc and tell me if that may be reasonable? Setting the NetworkInSync to false triggers a call to NetworkRecover on the next game loop.

mimooh commented 8 years ago

This didn't leave me any peace, and now I was able to reproduce it with Warcraft 1 dataset (running a network game with ~200 units, 12 players, and 256x256 map between mixed Ubuntu & Windows computers). Hah, now you see yourself! Tim, you know you will be the superhero if you fix this! :)

Karol Kreński

mimooh commented 8 years ago

And indeed, I now get alternating "Recovering" ... "Back in sync" messages. This might allow the game to continue, but the problem is definitely deeper - almost every second command triggers an out-of-sync, pretty sure that's not the way it's intended to be.

You mean: "how come that right after 'Back in sync' the game is immediately out of sync on the next game loop again", right? This would indicate that NetworkRecover doesn't really recover at all. Can't wait to find out what you found there :)

timfel commented 8 years ago

Ok, after a few more experiments, I can now reliably trigger this within the first 200 cycles even on a localhost connection. Both the syncseed and the synchash diverge. The seed seems to not do much other than test that all clients have the same overflow behavior, but it looks like it actually relies on undefined behavior and I believe it will have problems if you use different word sizes or compilers on different platforms. This part is easily fixed. The sync hash encodes actions taken over time, and right now it diverges even if no human player is present, which might indicate the AI has some non-determinism. Indeed, I was able to trigger some random behavior in worker selection for harvesting, which (from what I can tell) causes the sync hash to diverge immediately. This might not be the only thing. So, to fix this properly, a more thorough code review might be in order.

mimooh commented 8 years ago

Tim, it's a huge step forward. Thanks for you contribution!

mizukami999 commented 8 years ago

I was able to trigger some random behavior in worker selection for harvesting, which (from what I can tell) causes the sync hash to diverge immediately.

I came up with an experiment idea: start a game with 2 players, Reveal map: on, fog: off. populate map with peasants, make them harvest lots of stuff. Record video of player a's peasants harvesting with fraps a) on the local machine b) on a remote machine

then compare the videos to see if at a certain point peasant positions are no longer the same, and how early.

IF the problem is with AI behaviour, then it can be recreated without too many units on the map - only a small but crowded part of the map is needed

timfel commented 8 years ago

That's the kind of map I'm using right now - except that all the peasants are AI controlled, and the two human players don't do anything. It takes about 10 seconds to visibly see different peasants moving towards harvesting☺

timfel commented 8 years ago

And since (afaict) the syncHash relies on the idea that the AI deterministically executes the same on all machines, this is would definitely be a problem.

mizukami999 commented 8 years ago

FACEPALM! That's awful. I guess the normal procedure would be to have the server decide how the units move, and just inform all other machines on what really happens. Only the server is supposed to decide what actally happens. Only the server moves the units...

Clients may simulate unit movement for smooth performance, but then the update cometh, and the client corrects itself the peasant teleports to where he's supposed to be...

This reminds me of how I once had a nice Operation Flashpoint "game of friendship", where I saved my truck convoy on my machine, and my pal destroyed it on his. We both had screenshots as proof.

timfel commented 8 years ago

@mizukami999 well, it's not necessarily a bad idea, the nice thing about this architecture is that the server can crash or leave and we don't have to have code to re-negotiate who becomes the new server.

mizukami999 commented 8 years ago

It's true for advanced projects. But for something that keeps crashing because it can't synchronize units since multiplayer was introduced back in 1812... what people demand is order, clarity and purely totalitarian approach)

mizukami999 commented 8 years ago

Allowing multiplayer savegames could be an even better fix for server crashes...

mimooh commented 8 years ago

In case it matters: games between humans only (no AI) also desync.

Karol Kreński

And since (afaict) the syncHash relies on the idea that the AI deterministically executes the same on all machines, this is would definitely be a problem.


Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-181235345

mizukami999 commented 8 years ago

That's because AI is also about pathfinding: peasants colliding into one another, finding themselves in parallel universes.

timfel commented 8 years ago

It may also be because of the random seed. It would be useful to get a log for such a game with no AI that desyncs (start all clients with -p)

mimooh commented 8 years ago

I guess the normal procedure would be to have the server decide how the units move, and just inform all other machines on what really happens. Only the server is supposed to decide what actally happens. Only the server moves the units... The design is described in the beginning of network.cpp. If this is up to date then I'd say there's no central server in-game - the clients communicate directly with each other. I could do with network traffic workflow debugging if that would be required - if we want to know for sure if there's p2p.

\ @subsection sc_vs_p2p server/client vs. peer to peer \ @li server to client * The player input is send to the server. The server collects the input * of all players and than send the commands to all clients. \ @li peer to peer (p2p) \ The player input is direct send to all others clients in game. * p2p has the advantage of a smaller lag, but needs a higher bandwidth * by the clients. * p2p has been chosen for in-game. * s/c for the preparing room.

@Tim, concerning the hashes. So is it that each client has it's game state and then runs like md5() (some other function here) on this game state which results in a hash of sort A01BFF2138 that I can see in log files. And then these hashes are compared amongst each other and this is how we know the game states differ? Then we say the game desyncs, right?

Karol Kreński

--- Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-181241078

mimooh commented 8 years ago

It would be useful to get a log for such a game with no AI that desyncs (start all clients with -p) Higher in this thread (13 Dec 2015) you have exactly such a log of mine: http://www.inf.sgsp.edu.pl/~mimooh/outOfSync_13.dec.zip

Karol Kreński

timfel commented 8 years ago

@mimooh That log does not contain any desync messages. These messages look like so:

Network out of sync fabbd813!=fabcd712! 1278312038!=3820182023! Cycle 424
timfel commented 8 years ago

As for the hashes, the hash is basically what you describe, but we also generate a desync message if the RandomSeed is out of sync - these are different. The random seed is a pseudo-random generated number that starts off in the same state for all machines and is updated each time it is used (such as to determine unit directions, movement of spells like whirlwind, ...). If the hash goes out of sync, it indicates that the clients are executing different unit actions or have different units, if the random seed goes out of sync, it indicates that the clients use different randomness (for example, one uses randomness in their lua scripts while the other doesn't). Both are a problem. But the way it is, random seed can already go out of sync simply because of undefined behavior and different word sizes, so that is a bug in the engine.

timfel commented 8 years ago

I am working on a branch (tim/network-issues) to check these issues out. If you feel adventurous, you can try to compile and use it yourself. But I'm just poking at various parts of the code right now to see what happens, so no promises.

mimooh commented 8 years ago

In the evening I could create such logs. Strange, I thought I used the -p switch...

Karol Kreński

@mimooh That log does not contain any desync messages. These messages look like so:

Network out of sync fabbd813!=fabcd712! 1278312038!=3820182023! Cycle 424

Reply to this email directly or view it on GitHub: https://github.com/Wargus/stratagus/issues/150#issuecomment-181293400

mimooh commented 8 years ago

like whirlwind, ...). If the hash goes out of sync, it indicates that the clients are executing different unit actions or have different units, if the random seed goes out of sync, it indicates that the clients use different randomness (for example, one uses randomness in I admire how you managed to figure out the design from just reading the code. Out of curosity and based on these hints I will have a look (just read) at the code. Thanks Tim.

Karol Kreński

Andrettin commented 8 years ago

The random seed is a pseudo-random generated number that starts off in the same state for all machines and is updated each time it is used (such as to determine unit directions, movement of spells like whirlwind, ...).

I wonder - that could be the reason the issue hasn't happened with Wyrmsun, since it doesn't have a whirlwind-like spell.

timfel commented 8 years ago

@Andrettin the issue happens in my test map that only has workers within the first 300 cycles. No spells are being cast.

Andrettin commented 8 years ago

Ah yes :(

mizukami999 commented 8 years ago

whirlwind lol one needs to make this spell generate a stop error screen on enemy machine

mimooh commented 8 years ago

whirlwind lol one needs to make this spell generate a stop error screen on enemy machine Hahaha! Indeed :)

Karol Kreński

mimooh commented 8 years ago

network.cpp: * @li peer to peer (p2p) * The player input is direct send to all others clients in game. \ p2p has been chosen for in-game.

Whatever they mean by the above, based on the network traffic I'd say the clients never comunicate with each other. I ran a 3 players game (server + 2 clients) and all the traffic goes through the server.

(...) But the way it is, random seed can already go out of sync simply because of undefined behavior and different word sizes, so that is a bug in the engine. @Tim, surely that's reasonable, but I am probably affected by hash out of sync, since I test on 16 identical machines (hardware and software).

timfel commented 8 years ago

@mimooh I tested via localhost connection on the same machine running the exact same binary and randomseed went out of sync... ;)

mimooh commented 8 years ago

Aaah, I see. You think your latest fixes under tim/network-issues may be good enough already? You think I should try and break the session as usual? :) Or it's too early for testing?

timfel commented 8 years ago

I have not tested the last code change I pushed, and have no time to work on it right now

mimooh commented 8 years ago

Ahhh, now I get all this... looks like stratagus follows the ideas described on these two, very interesting sites:

  1. http://gafferongames.com/networking-for-game-programmers/what-every-programmer-needs-to-know-about-game-networking/
  2. http://www.gamasutra.com/view/feature/131503/1500_archers_on_a_288_network_.php?page=1

I was more familiar with how quake networking works, but RTS games are different, because game states are too huge to be sent over the network, so:

a) all clients start from very same state b) all clients send their actions to the network c) all clients get other clients actions and then they just repeat these actions locally and all clients should end up in the very same game state.

It is a p2p in a sense, because there's no central place where the global game state is stored.

Unfortunatelly, out-of-sync is very challenging:

"At first take it might seem that getting two pieces of identical code to run the same should be fairly easy and straightforward -- not so. The Microsoft product manager, Tim Znamenacek, told Mark early on, "In every project, there is one stubborn bug that goes all the way to the wire -- I think out-of-sync is going to be it." He was right. The difficulty with finding out-of-sync errors is that very subtle differences would multiply over time. A deer slightly out of alignment when the random map was created would forage slightly differently -- and minutes later a villager would path a tiny bit off, or miss with his spear and take home no meat. So what showed up as a checksum difference as different food amounts had a cause that was sometimes puzzling to trace back to the original cause.

As much as we check-summed the world, the objects, the pathfinding, targeting and every other system -- it seemed that there was always one more thing that slipped just under the radar."

and

"Nobody on the Age of Empires development team would argue the need for the best sync tools possible. As with any project, when you look back on the development process during a postmortem, some areas always stand out as the ones you spent the most time on, but could have spent much less time on given more up-front work. Synchronization debugging was probably at the top of this list as we started development on RTS3."

mimooh commented 8 years ago

"The random seed is a pseudo-random generated number that starts off in the same state for all machines and is updated each time it is used (such as to determine unit directions, movement of spells like whirlwind, ...)"

If we removed the randomness, just for a while if we made it a constant or always increment by 1 (if it cannot be constant), wouldn't it's only effect be the loss of some little randomness in these unit directions and whirlwind etc.?

Andrettin commented 8 years ago

If we removed the randomness, just for a while if we made it a constant or always increment by 1 (if it cannot be constant), wouldn't it's only effect be the loss of some little randomness in these unit directions and whirlwind etc.?

The randomness is vital - it determines the quantity of damage dealt, for instance. In any case, even if it were removed, that wouldn't really fix the problem - the event that triggered the discrepancy in the random seed would still occur - we just would have more trouble identifying it.

mizukami999 commented 8 years ago

If I have authority to move my units then i must inform everyone else on where my units are. And they must be updated on remote machines. Otherwise I only send my commands, like "i wish this peasant could go there". Someone who knows better responds, like: sorry to infowm you, but your peasant has been dead for a minute already. Either I execute randomness and inform others on results, or the server does it. I don't think there is any democracy in this procedure. You can simulate whatever you want on your machine for smooth performance, but either you update the others on what really happens with your units, or the server tells you that. Someone has to say: no, this unit isn't supposed to be here.

mimooh commented 8 years ago

@Andrettin, I went thru lot's of reading about RTS networking since yesterday. I hope to finally see the big picture, I learnt about the main issues.

What is the least clear to me is: what the hell is the purpose of randomness in the first place? Seems like it just enhancement in the game so that you don't get bored - randomness prevents you from winning against the AI in the very same clicks over and over, because the damage you make to other units, because the direction of whirlwinds and perhaps other details are random. Or is randomness serving some other purposes?

My argument is: Perhaps it makes sense in human vs AI. But in human vs human multiplayer game there's enough randomness from just the fact that we probably won't have exactly the same game again - that would be true even for chess - you and your opponent's moves would produce a different game 99% of the time.

So we could just try to simplify the game and remove randomness and see how players like it and do they see the change at all in human vs human games.

If, however we need the randomness, then we invite the low level scarry monsters. All out-of-sync hell seems to result from the need to produce exactly the same randomness on each participant computer.

I see this code in stratagus:

float radians = deg2rad(MyRand() % 360);

floats... And then I read the comments of IT experts at http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/

"People even report different results on the same machine from run to run, and between debug and release builds. Other folks say that AMDs give different results to Intel machines, and that SSE results are different from x87."

"Many programmers may not realize that even a program that uses only the numeric formats and operations prescribed by the IEEE standard can compute different results on different systems."

"So if you’re trying to floating point determinism; stay away from the intrinsic hardware calls like (_mm_rsqrt_ss), they’re not the same between Intel and AMD."

"It’s probably time to stop depending on floats to be deterministic on PCs, though. No new features on 64 bit CPUs are deterministic across models / manufacturers. All hail the rise of cross platform deterministic int based game and physics engines!"

"We dealt with this back in 1996 (...) After several attempts to get the then-current version of MSVC to behave according to IEEE 754 and match the PPC version, we moved to fixed-point."

As you can see some people propose switching from float to fixed point. Which only opens another bag of issues, e.g. you need to implement sqrt(), sin() and resolve divide by 0 and such.

What I liked the most and found only in one place is an interesting proposition: just hardcode some values in a list and then see about taking from that list. Because, hey, how many different angles do you need for the whirlwind direction?

There is the second category of out-of-sync:

"Desyncs are usually programmer error (...) of the most common cases is an uninitialized variable."

and then "A Demigod Tale" case study at http://www.gamasutra.com/view/news/126022/Opinion_Synchronous_RTS_Engines_And_A_Tale_of_Desyncs.php

a-detiste commented 8 years ago

this thread is awesome

randomness is part of the fun, it's about taking a bit of risk etc...

the only thing I like even more than Warcraft II is "Advance Wars 2" and that would be boring without some random hit-points

http://jmtd.net/log/deterministic_doom/

for the radian problem: maybe rounding those to closest integer would be ok ?

Andrettin commented 8 years ago

I see this code in stratagus: float radians = deg2rad(MyRand() % 360);

The use of "MyRand()" here really drew my attention - this code isn't using SyncRand(), which is the proper function for synchronized randomness. I did a search for MyRand() and found that it is used for unit directions instead of SyncRand()... This will result in units in different computers having different directions. This could be a desync cause, if the directions are influencing anything which affects gameplay.