andrei-drexler / ironwail

High-performance QuakeSpasm fork
GNU General Public License v2.0
481 stars 43 forks source link

Intermittent lockup when loading a game saved at start of e1m6 #242

Open matthewearl opened 10 months ago

matthewearl commented 10 months ago

Bug repro

  1. Start ironwail, map e1m6
  2. Run restart, the game will pause on the first frame with the console dropped.
  3. Run save t to save a game at this point.
  4. Bind x to load t.
  5. Raise console.
  6. Press x when clock times to 2 seconds.

Repeat step 6 until game freezes. I normally hit the issue within about a minute or so.

The stack trace on freeze looks something like this:

(gdb) bt
#0  SV_AreaTriggerEdicts (ent=0x55555a74f510, node=0x555556056768 <sv_areanodes+840>, list=0x7fffffffd360, listcount=0x7fffffffdb78, listspace=255) at world.c:307
#1  0x0000555555624ebf in SV_AreaTriggerEdicts (ent=0x55555a74f510, node=0x5555560566f8 <sv_areanodes+728>, list=0x7fffffffd360, listcount=0x7fffffffdb78, listspace=255) at world.c:323
#2  0x0000555555624ebf in SV_AreaTriggerEdicts (ent=0x55555a74f510, node=0x555556056618 <sv_areanodes+504>, list=0x7fffffffd360, listcount=0x7fffffffdb78, listspace=255) at world.c:323
#3  0x0000555555624ebf in SV_AreaTriggerEdicts (ent=0x55555a74f510, node=0x555556056458 <sv_areanodes+56>, list=0x7fffffffd360, listcount=0x7fffffffdb78, listspace=255) at world.c:323
#4  0x0000555555624e78 in SV_AreaTriggerEdicts (ent=0x55555a74f510, node=0x555556056420 <sv_areanodes>, list=0x7fffffffd360, listcount=0x7fffffffdb78, listspace=255) at world.c:321
#5  0x0000555555624fb8 in SV_TouchLinks (ent=0x55555a74f510) at world.c:346
#6  0x000055555562595b in SV_LinkEdict (ent=0x55555a74f510, touch_triggers=true) at world.c:532
#7  0x0000555555622048 in SV_Physics () at sv_phys.c:1204
#8  0x00005555555f6e40 in Host_ServerFrame () at host.c:910
#9  0x00005555555f7d0e in _Host_Frame (time=0.0039993420068640262) at host.c:1184
#10 0x00005555555f809e in Host_Frame (time=0.0039993420068640262) at host.c:1267
#11 0x000055555562a888 in main (argc=3, argv=0x7fffffffde18) at main_sdl.c:210

Debugging

This diff prints out when entities are linked / unlinked from the area tree. When the bug is reproduced, some entity slots have two links without an intervening unlink, for example:

1.100000: linking 253
1.100000: unlinking 253
1.100000: linking 253
1.100000: unlinking 253
1.100000: linking 253
1.341014: linking 253
1.341014: unlinking 253
1.341014: linking 253
1.341014: unlinking 253
1.341014: linking 253

I think this comes about from a temp entity being allocated and linked in SV_SpawnServer (during one of its two SV_Physics steps), being wiped but not unlinked by the memset in Host_Loadgame_f, and then the slot being reused as the loaded game progresses, without an unlink since the area next/prev pointers were wiped. One of the lists then becomes corrupt and the game ends up in an infinite loop when trying to iterate it.

If my theory is right this could be seen with save games not made immediately after a restart, but I haven't had luck reproducing it this way.

alexey-lysiuk commented 10 months ago

It's the same bug I reported earlier in Quakespasm repo, https://github.com/sezero/quakespasm/issues/64.

matthewearl commented 10 months ago

Thanks, I should have checked the Quakespasm repo too. I confirmed that https://github.com/sezero/quakespasm/pull/75 stops my repro from working on QS.

matthewearl commented 10 months ago

After porting your fix to ironwail, there's still a similar bug that occurs where two links without an intervening unlink occur. This time the entity is cleared in ED_Alloc:

__memset_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:284
284     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.
#0  __memset_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:284
#1  0x000055555560be53 in ED_Alloc () at pr_edict.c:217
#2  0x000055555560363a in PF_Spawn () at pr_cmds.c:1061
#3  0x0000555555614675 in PR_ExecuteProgram (fnum=472) at pr_exec.c:656
#4  0x000055555561b45c in SV_RunThink (ent=0x555558e5e720) at sv_phys.c:143
#5  0x000055555561e1ae in SV_Physics_None (ent=0x555558e5e720) at sv_phys.c:983
#6  0x000055555561e9ce in SV_Physics () at sv_phys.c:1212
#7  0x00005555555f3f16 in Host_ServerFrame () at host.c:910
#8  0x00005555555f4df8 in _Host_Frame (time=0.00099927200062666088) at host.c:1186
#9  0x00005555555f5184 in Host_Frame (time=0.00099927200062666088) at host.c:1269
#10 0x00005555556270f6 in main (argc=7, argv=0x7fffffffdcf8) at main_sdl.c:210

Here's the commit with my port of the fix: https://github.com/matthewearl/quakespasm/commit/e71d48475ecf20fca9b55ed6078125e9aca828b7

The repro steps are the same as before. Seems to work best if you let the clock run to 5 seconds before resetting. I've been running with host_timescale 10 to make it faster.

alexey-lysiuk commented 10 months ago

What was the error? EXC_BAD_ACCESS? What are the value of num_edicts and e then? The latter should point inside allocated block assigned to edicts as it's size is max_edicts * pr_edict_size while num_edicts should be always less than max_edicts.

Also, are you sure that added for loop does ED_Free() at least once when it crashes?

matthewearl commented 10 months ago

I think the problem is to do with how ironwail handles freeing objects. It appears that in normal operation num_edicts doesn't ever decrease. Calling ED_Free just adds the slot onto a free list to be used first when a new entity is allocated.

When the game loads the new loop kicks in, calls ED_Free on entity n, but then on the next line num_edicts is decreased to be equal to n. Subsequently ED_Alloc is called, and first gets entity n off the free list. Then ED_Alloc gets called a further time, but the free list is empty, so edicts[num_edicts++] == edicts[n] is returned. And so slot n ends up being linked twice.

So I think the right fix is to not set num_edicts in Host_Loadgame_f if it would cause num_edicts to decrease.

alexey-lysiuk commented 10 months ago

Again, what was the error in ED_Alloc()? No matter how broken edict linking is, it shouldn't crash on that memset() inside ED_Alloc().

matthewearl commented 10 months ago

The stack trace from before wasn't a crash, it's just the stack trace at the point where the prev/next pointers are cleared for the entity that gets linked twice without an intervening unlink. The game proceeds past this point but ultimately ends up in an infinite loop iterating one of the lists, as before.

sezero commented 10 months ago

Does the issue show itself with current quakespasm git master?

matthewearl commented 10 months ago

I can confirm that I can't reproduce the issue with https://github.com/sezero/quakespasm/commit/bd331e2192d27b803993150efeaf05d26051113c (ie. current quakespasm master)

sezero commented 10 months ago

That's great, thanks for testing!

So the additional problem you're describing at https://github.com/andrei-drexler/ironwail/issues/242#issuecomment-1678643170 is probably ironwail-specific.

matthewearl commented 10 months ago

Yes, I'm pretty sure it is ironwail specific. I naively ported the fix without realizing that num_edicts shouldn't be decreased, but I think that's fine to do in Quakespasm.