DarkPlacesEngine / DarkPlaces

The DarkPlaces Quake engine, created by LadyHavoc. Official Git repository (replaces SVN).
https://icculus.org/twilight/darkplaces/
GNU General Public License v2.0
291 stars 43 forks source link

Crash in R_DrawModelDecals_FadeEntity (numdecals becomes negative) #216

Closed andreymal closed 1 week ago

andreymal commented 2 weeks ago

Sometimes, very rarely, something strange happens in R_DrawModelDecals_FadeEntity and numdecals becomes negative, resulting in out-of-bounds.

Unfortunately, I don't know the specific conditions under which this crash can be reproduced. It appears to be random.

I first encountered this crash on frankishtower15 in Xonotic instagib, so here's how I try to reproduce this crash:

Backtrace from https://github.com/DarkPlacesEngine/DarkPlaces/commit/caa1458b53e6a960e9c642dad6baebbdb500a722:

Thread 1 "xonotic-linux64" received signal SIGSEGV, Segmentation fault.
#0  0x00005555556a5b06 in R_DrawModelDecals_FadeEntity (ent=0x55557a95f528) at ../../../gl_rmain.c:9596
#1  0x00005555556a70e5 in R_DrawModelDecals () at ../../../gl_rmain.c:9751
#2  R_RenderScene (viewfbo=viewfbo@entry=0, viewdepthtexture=viewdepthtexture@entry=0x0, viewcolortexture=viewcolortexture@entry=0x0, viewx=viewx@entry=0, viewy=viewy@entry=0, viewwidth=viewwidth@entry=640, viewheight=480) at ../../../gl_rmain.c:5934
#3  0x00005555556aa847 in R_RenderView (fbo=fbo@entry=0, depthtexture=depthtexture@entry=0x0, colortexture=colortexture@entry=0x0, x=0, y=0, width=640, height=480) at ../../../gl_rmain.c:5783
#4  0x000055555564acaa in VM_CL_R_RenderScene (prog=0x555556887f78 <prvm_prog_list+377176>) at ../../../clvm_cmds.c:4112
#5  0x000055555575267c in CLVM_ExecuteProgram (prog=0x555556887f78 <prvm_prog_list+377176>, fnum=<optimized out>, errormessage=<optimized out>) at ../../../prvm_execprogram.h:726
#6  0x000055555566c991 in CL_VM_UpdateView (frametime=0.0018036490073427558) at ../../../csprogs.c:530
#7  0x0000555555635a93 in SCR_DrawScreen () at ../../../cl_screen.c:1657
#8  0x0000555555637e88 in CL_UpdateScreen () at ../../../cl_screen.c:2335
#9  0x0000555555614d16 in CL_Frame (time=time@entry=0.0018036490073427558) at ../../../cl_main.c:2911
#10 0x00005555556b70d9 in Host_Frame (time=0.0018036490073427558) at ../../../host.c:647
#11 0x00005555557d3fab in Sys_Frame () at ../../../sys_shared.c:1160
#12 0x00005555557d45b5 in Sys_Main (argc=<optimized out>, argv=<optimized out>) at ../../../sys_shared.c:1218
#13 0x00005555555ccfdb in main (argc=<optimized out>, argv=<optimized out>) at ../../../sys_sdl.c:75

Some variables:

(gdb) print numdecals
$3 = -5494051
(gdb) print decalsystem->freedecal
$4 = 129

https://github.com/user-attachments/assets/0dd17528-0015-4b72-b1e8-0227235379c4

andreymal commented 2 weeks ago

divVerent suggested applying a patch that adds additional checks and aborts the process immediately after something starts to go wrong, so there is another backtrace.

The patch ```diff diff --git a/gl_rmain.c b/gl_rmain.c index a1afc55b..72dfe8de 100644 --- a/gl_rmain.c +++ b/gl_rmain.c @@ -9169,13 +9169,19 @@ static void R_DecalSystem_SpawnTriangle(decalsystem_t *decalsystem, const float } // grab a decal and search for another free slot for the next one + if (decalsystem->numdecals < 0) + abort(); decals = decalsystem->decals; + if (decalsystem->freedecal > decalsystem->numdecals || decalsystem->numdecals < 0) + abort(); decal = decalsystem->decals + (i = decalsystem->freedecal++); for (i = decalsystem->freedecal;i < decalsystem->numdecals && decals[i].color4f[0][3];i++) ; decalsystem->freedecal = i; if (decalsystem->numdecals <= i) decalsystem->numdecals = i + 1; + if (decalsystem->freedecal > decalsystem->numdecals || decalsystem->numdecals < 0) + abort(); // initialize the decal decal->lived = 0; @@ -9568,6 +9574,8 @@ static void R_DrawModelDecals_FadeEntity(entity_render_t *ent) frametime = 0; decalsystem->lastupdatetime = r_refdef.scene.time; numdecals = decalsystem->numdecals; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); for (i = 0, decal = decalsystem->decals;i < numdecals;i++, decal++) { @@ -9579,24 +9587,40 @@ static void R_DrawModelDecals_FadeEntity(entity_render_t *ent) memset(decal, 0, sizeof(*decal)); if (decalsystem->freedecal > i) decalsystem->freedecal = i; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); } } } decal = decalsystem->decals; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); while (numdecals > 0 && !decal[numdecals-1].color4f[0][3]) numdecals--; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); // collapse the array by shuffling the tail decals into the gaps for (;;) { while (decalsystem->freedecal < numdecals && decal[decalsystem->freedecal].color4f[0][3]) decalsystem->freedecal++; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); if (decalsystem->freedecal == numdecals) break; + if (numdecals == 0) + abort(); decal[decalsystem->freedecal] = decal[--numdecals]; + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); } + if (decalsystem->freedecal > numdecals || numdecals < 0) + abort(); decalsystem->numdecals = numdecals; + if (decalsystem->freedecal > decalsystem->numdecals || decalsystem->numdecals < 0) + abort(); if (numdecals <= 0) { ```
Patched gl_rmain.c (screenshot with line numbers) ![2024-11-11-14-35-25](https://github.com/user-attachments/assets/359ed18b-5696-4da7-bf24-821dc972ad46)
Backtrace from patched DP ``` Thread 1 "xonotic-linux64" received signal SIGABRT, Aborted. #0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007ffff796c463 in __pthread_kill_internal (threadid=, signo=6) at pthread_kill.c:78 #2 0x00007ffff7913120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007ffff78fa4c3 in __GI_abort () at abort.c:79 #4 0x00005555556a5b2e in R_DrawModelDecals_FadeEntity (ent=) at ../../../gl_rmain.c:9601 #5 0x00005555556a7165 in R_DrawModelDecals () at ../../../gl_rmain.c:9775 #6 R_RenderScene (viewfbo=viewfbo@entry=0, viewdepthtexture=viewdepthtexture@entry=0x0, viewcolortexture=viewcolortexture@entry=0x0, viewx=viewx@entry=0, viewy=viewy@entry=0, viewwidth=viewwidth@entry=640, viewheight=480) at ../../../gl_rmain.c:5934 #7 0x00005555556aa8c7 in R_RenderView (fbo=fbo@entry=0, depthtexture=depthtexture@entry=0x0, colortexture=colortexture@entry=0x0, x=0, y=0, width=640, height=480) at ../../../gl_rmain.c:5783 #8 0x000055555564acaa in VM_CL_R_RenderScene (prog=0x555556887f98 ) at ../../../clvm_cmds.c:4112 #9 0x00005555557526fc in CLVM_ExecuteProgram (prog=0x555556887f98 , fnum=, errormessage=) at ../../../prvm_execprogram.h:726 #10 0x000055555566c991 in CL_VM_UpdateView (frametime=0.0048992739903042093) at ../../../csprogs.c:530 #11 0x0000555555635a93 in SCR_DrawScreen () at ../../../cl_screen.c:1657 #12 0x0000555555637e88 in CL_UpdateScreen () at ../../../cl_screen.c:2335 #13 0x0000555555614d16 in CL_Frame (time=time@entry=0.004726382001535967) at ../../../cl_main.c:2911 #14 0x00005555556b7159 in Host_Frame (time=0.004726382001535967) at ../../../host.c:647 #15 0x00005555557d402b in Sys_Frame () at ../../../sys_shared.c:1160 #16 0x00005555557d4635 in Sys_Main (argc=, argv=) at ../../../sys_shared.c:1218 #17 0x00005555555ccfdb in main (argc=, argv=) at ../../../sys_sdl.c:75 ```

https://github.com/user-attachments/assets/3c82dc9a-8028-46df-b800-913a21a71fba

divVerent commented 2 weeks ago

Removing the loop at

https://github.com/DarkPlacesEngine/DarkPlaces/blob/master/gl_rmain.c#L9586

will quite surely fix it - this loop reduces numdecals without reducing freedecal in case it steps above numdecals.

However, I don't see WHY this would happen anyway - the alpha of a decal should only ever be 0 or 1, and the only place setting it to 0 seems to be the memset above.

divVerent commented 2 weeks ago

Can you try replacing the loop

    while (numdecals > 0 && !decal[numdecals-1].color4f[0][3])
        numdecals--;

by

    while (numdecals > 0 && !decal[numdecals-1].color4f[0][3])
        {
        numdecals--;
                if (decalsystem->freedecal > numdecals)
                        abort();
        }

and if this hits, tell me:

(gdb) print numdecals
(gdb) print decalsystem->freedecal
(gdb) print decal[numdecals]
(gdb) print decal[numdecals].color4f[0][3]

Thanks!

andreymal commented 2 weeks ago

gdb says "optimized out" even with -O0 :(

divVerent commented 2 weeks ago

To all of them?

divVerent commented 2 weeks ago

Otherwise, I could offer to rewrite this code without the freedecal int, which IMHO doesn't even save performance anyway. Algorithmically simpler should be more likely to be correct.

divVerent commented 2 weeks ago

Try this.

simpler-decals.diff.txt

Replaces the current mark-and-cleanup approach of deleting decals by just immediately moving the last decal to the to be deleted spot.

andreymal commented 2 weeks ago

I always knew printf is the best debugger

(this ↓ is without the new patch yet)

decalsystem->freedecal > numdecals, aborting
ent->entitynumber = 4104
decalsystem->numdecals = 65
decalsystem->freedecal = 65
numdecals = 64
decal[64] = {
  texcoord2f[0][0] = 0.585098,
  texcoord2f[0][1] = 0.465171,
  texcoord2f[1][0] = 0.534220,
  texcoord2f[1][1] = 1.000000,
  texcoord2f[2][0] = 0.585098,
  texcoord2f[2][1] = 0.465171,
  vertex3f[0][0] = 0.534220,
  vertex3f[0][1] = 1.000000,
  vertex3f[0][2] = 0.000000,
  vertex3f[1][0] = 0.000000,
  vertex3f[1][1] = 0.000000,
  vertex3f[1][2] = 1.000000,
  vertex3f[2][0] = 0.585098,
  vertex3f[2][1] = 0.465171,
  vertex3f[2][2] = 0.534220,
  color4f[0][0] = 1.000000,
  color4f[0][1] = 0.000000,
  color4f[0][2] = 0.000000,
  color4f[0][3] = 0.000000,
  color4f[1][0] = 1.000000,
  color4f[1][1] = 0.000000,
  color4f[1][2] = 0.000000,
  color4f[1][3] = 0.000000,
  color4f[2][0] = 1.000000,
  color4f[2][1] = 0.606560,
  color4f[2][2] = 0.482234,
  color4f[2][3] = 0.553815,
  plane[0] = 1.000000,
  plane[1] = 0.606560,
  plane[2] = 0.482234,
  plane[3] = 0.553815,
  lived = 1.000000,
  triangleindex = 1040697689,
  surfaceindex = 1037559990,
  decalsequence = 1039660353,
}

debugging-printf t

andreymal commented 2 weeks ago

simpler-decals.diff seems to work well, no crashes for ~6 hours (will test longer in the next few days)

andreymal commented 1 week ago

Linux screaming BUG: unable to handle page fault for address: ffff9772752c8410 several times (I hope this is unrelated to Xonotic) prevented me from testing for longer than 20 hours straight, but simpler-decals.diff doesn't crash (also played a few matches on feris and didn't notice anything unusual)

xon-test2

divVerent commented 1 week ago

OK, I'll make that change then.

I don't quite like it, as I still don't understand why this happens, but at least the engine runs now I guess.

(I did find bugs in the old logic, yes, but none that look like they could make it crash...)

Also, these numbers:

  triangleindex = 1040697689,
  surfaceindex = 1037559990,
  decalsequence = 1039660353,

look oddly high, although that might be expected on a map by this particular author.

divVerent commented 1 week ago
decalsystem->freedecal = 65
numdecals = 64

That actually does violate an invariant - it may allocate the next decal at 65, while thinking there are only 64, so this may overflow the bounds of the decal buffer without triggering the reallocation logic in R_DecalSystem_SpawnTriangle. Not sure if that is what did happen, but 64 is a possible and likely value of decalsystem->maxdecals, skipped the allocation and wrote into something else's memory.

Not sure if that is what did happen, but it sounds like a possibility.

The numbers above could actually be color-like floating point numbers:

[rpolzer@srv04 darkplaces (git)-[divVerent/simpler-decals]-]$ perl -e 'printf "%.17g\n", unpack "f", pack "V", 1040697689'
0.13260401785373688
[rpolzer@srv04 darkplaces (git)-[divVerent/simpler-decals]-]$ perl -e 'printf "%.17g\n", unpack "f", pack "V", 1037559990'
0.10542432963848114
[rpolzer@srv04 darkplaces (git)-[divVerent/simpler-decals]-]$ perl -e 'printf "%.17g\n", unpack "f", pack "V", 1039660353'
0.12107325345277786

which speaks for this theory.

divVerent commented 1 week ago

Please try the final version of the patch on that PR. It's the same but except for the Sys_Error in case a decal has zero alpha (as this seems indeed unnecessary, it was just the old code's way to mark a decal as expired, and even if it were to happen, that decal will expire soon anyway).

andreymal commented 1 week ago

Also, these numbers:

triangleindex = 1040697689,
surfaceindex = 1037559990,
decalsequence = 1039660353,

look oddly high, although that might be expected on a map by this particular author.

I also caught one crash on Stormkeep, those numbers were similarly high

divVerent commented 1 week ago

Thanks for reporting!

And - sadly - this issue would still have happened in Rust. But we'd have gotten a clearer runtime error and been able to fix it quicker.