ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.
MIT License
583 stars 47 forks source link

fossilize_replay dumped core #117

Open kakra opened 3 years ago

kakra commented 3 years ago

I've lately seen this for the first time:

# dmesg
[101136.441571] fossilize_repla[540674]: segfault at 10 ip 000055cfc73cdd74 sp 00007ffeb8737bb0 error 4 cpu 7 in fossilize_replay[55cfc73a1000+169000]
[101136.441580] Code: 00 48 c7 44 24 58 00 00 00 00 48 8d 8c c8 30 08 00 00 48 8b 80 b0 08 00 00 48 c7 44 24 60 00 00 00 00 c7 44 24 40 11 00 00 00 <48> 8b 78 10 48 8d 05 31 d0 16 00 ff 10 e9 0f ff ff ff 89 c7 e8 93

I'm not sure how to decode this, I'm also seeing a similar dmesg output for Electron apps sometimes. The coredump isn't very helpful either because debug info seems missing:

# sudo coredumpctl info 540674
           PID: 540674 (fossilize_repla)
           UID: 500 (kakra)
           GID: 500 (kakra)
        Signal: 11 (SEGV)
     Timestamp: Sun 2020-12-20 23:32:05 CET (2 days ago)
  Command Line: /home/kakra/.local/share/Steam/ubuntu12_32/../ubuntu12_64/fossilize_replay /home/kakra/.local/share/Steam/steamapps/shadercache/489830/fozpipelinesv4/steamapp_pipeline_cache.foz /home/kakra/.local/share/Steam/steamapps/shadercache/489830/fozpipelinesv4/steam_pipeline_cache.foz --master-process --quiet-slave --shmem-fd 82 --spirv-val --num-threads 2 --on-disk-validation-whitelist /home/kakra/.local/share/Steam/steamapps/shadercache/489830/fozpipelinesv4/steam_pipeline_cache_whitelist --device-index 0 --timeout-seconds 10 --implicit-whitelist 0
    Executable: /home/kakra/.local/share/Steam/ubuntu12_64/fossilize_replay
 Control Group: /user.slice/user-500.slice/user@500.service/app.slice/app-steam-2cad372f130f4fb1b9d86232268f33c5.scope
          Unit: user@500.service
     User Unit: app-steam-2cad372f130f4fb1b9d86232268f33c5.scope
         Slice: user-500.slice
     Owner UID: 500 (kakra)
       Boot ID: 25c3b6c46f9341b4b5d0d292e264aec4
    Machine ID: 121b87ca633e8ac0016656680000001b
      Hostname: jupiter
       Storage: /var/lib/systemd/coredump/core.fossilize_repla.500.25c3b6c46f9341b4b5d0d292e264aec4.540674.1608503525000000.zst
       Message: Process 540674 (fossilize_repla) of user 500 dumped core.

Error 4 probably means user-space was faulting for a non-existing page (PF_USER).

I copied the 489830 cache to my local build and try to reproduce now with this command:

./fossilize-replay 489830/fozpipelinesv4/steamapp_pipeline_cache.foz 489830/fozpipelinesv4/steam_pipeline_cache.foz --spirv-val --on-disk-validation-whitelist 489830/fozpipelinesv4/steam_pipeline_cache_whitelist --device-index 0 --timeout-seconds 10 --implicit-whitelist 0

It resulted in the following log: https://gist.github.com/kakra/0272aa4ca003836750c18e687d6e1bf3

Retry with a debug build?

ryanmusante commented 1 year ago

Segfaults on 3 different CPU threads

[  +0.075551] fossilize_repla[16964]: segfault at 18 ip 0000558565ecb65e sp 00007ffd34212f60 error 4 in fossilize_replay[558565e63000+23d000] likely on CPU 5 (core 5, socket 0)
[  +0.000015] Code: 85 db 75 d8 49 8b 9f 60 02 00 00 48 85 db 74 2c 0f 1f 40 00 48 8b 73 10 48 85 f6 0f 84 83 02 00 00 49 8b 87 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 70 40 21 00 48 8b 1b 48 85 db 75 d8 49 8b 9f 98
[  +0.109306] fossilize_repla[16965]: segfault at 18 ip 0000558565ecb65e sp 00007ffd34212f60 error 4 in fossilize_replay[558565e63000+23d000] likely on CPU 13 (core 5, socket 0)
[  +0.000017] Code: 85 db 75 d8 49 8b 9f 60 02 00 00 48 85 db 74 2c 0f 1f 40 00 48 8b 73 10 48 85 f6 0f 84 83 02 00 00 49 8b 87 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 70 40 21 00 48 8b 1b 48 85 db 75 d8 49 8b 9f 98
[  +2.998189] fossilize_repla[17043]: segfault at 18 ip 000055b93cc9265e sp 00007ffdf33deeb0 error 4 in fossilize_replay[55b93cc2a000+23d000] likely on CPU 1 (core 1, socket 0)
[  +0.000019] Code: 85 db 75 d8 49 8b 9f 60 02 00 00 48 85 db 74 2c 0f 1f 40 00 48 8b 73 10 48 85 f6 0f 84 83 02 00 00 49 8b 87 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 70 40 21 00 48 8b 1b 48 85 db 75 d8 49 8b 9f 98
Smoukus commented 1 year ago

I got following entries after I ran sudo journalctl -p err -b 0: https://gist.github.com/Smoukus/beb6099b940edad13ac08067257aec4d

The coredumps happened when I was playing Guild Wars 2 on Steam. the game hadn't crashes or suffered any issues, it's just that coredumps happened during that time.

perroboc commented 1 year ago

I get constant fossilize_repla coredumps like this. I'm not playing any game though.

Process 126010 (fossilize_repla) of user 1000 dumped core.

                Stack trace of thread 126010:
                #0  0x0000555d89b66f47 n/a (/home/user/.local/share/Steam/ubuntu12_64/fossilize_replay + 0x59f47)
                #1  0x89d8db780000555d n/a (n/a + 0x0)
                ELF object binary architecture: AMD x86-64

Is this expected?

Trayshar commented 1 year ago

I get coredumps of 1-9 fosslize_replay processes when launching Deep Rock Galactic. Sometimes, the "compiling shader" dialog get stuck and I have to manually skip/close it. Sometimes it continues, and sometimes fossilize_replay just doesn't crash and the shaders compile just fine. It doesn't seem to follow any pattern, really.

I'm running an Intel Arc A770 with latest mesa drivers.

EDT: This doesn't affect game performance, but weirdly enough sometimes the game hangs my graphics driver. I raised this issue with the mesa devs. As the hangs seem to be a regression in the driver and are not related to fosslize_replay's behaiviour, I think these issues are not related at the time of writting this.

kernel: fossilize_repla[35616]: segfault at 18 ip 000055d1650c7ade sp 00007fff1f7fdeb0 error 4 likely on CPU 7 (core 7, socket 0)
kernel: Code: 85 db 75 d8 49 8b 9f 60 02 00 00 48 85 db 74 2c 0f 1f 40 00 48 8b 73 10 48 85 f6 0f 84 83 02 00 00 49 8b 87 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 f0 4b 21 00 48 8b 1b 48 85 db 75 d8 49 8b 9f 98
systemd-coredump[35765]: [🡕] Process 35616 (fossilize_repla) of user 1000 dumped core.

                                                          Stack trace of thread 35616:
                                                          #0  0x000055d1650c7ade n/a (/home/user/.local/share/Steam/ubuntu12_64/fossilize_replay + 0x69ade)
                                                          ELF object binary architecture: AMD x86-64
kernel: fossilize_repla[35788]: segfault at 18 ip 000055d1650b7f47 sp 00007fff1f7fddd0 error 4 likely on CPU 1 (core 1, socket 0)
kernel: Code: 39 a5 98 03 00 00 75 76 0f 1f 00 48 8b 44 24 08 49 8d 5c c5 00 48 8b b3 68 0c 00 00 48 85 f6 74 13 49 8b 85 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 7f 47 22 00 48 c7 83 68 0c 00 00 00 00 00 00 49
systemd-coredump[35794]: [🡕] Process 35788 (fossilize_repla) of user 1000 dumped core.

                                                          Stack trace of thread 35788:
                                                          #0  0x000055d1650b7f47 n/a (/home/user/.local/share/Steam/ubuntu12_64/fossilize_replay + 0x59f47)
                                                          #1  0x0000000000000001 n/a (n/a + 0x0)
                                                          ELF object binary architecture: AMD x86-64
kernel: fossilize_repla[35792]: segfault at 18 ip 000055d1650c75ce sp 00007fff1f7fde70 error 4 in fossilize_replay[55d16505e000+240000] likely on CPU 4 (core 4, socket 0)
kernel: Code: 85 db 75 d8 49 8b 9f 60 02 00 00 48 85 db 74 2c 0f 1f 40 00 48 8b 73 10 48 85 f6 0f 84 83 02 00 00 49 8b 87 f8 0c 00 00 31 d2 <48> 8b 78 18 ff 15 00 51 21 00 48 8b 1b 48 85 db 75 d8 49 8b 9f 98
systemd-coredump[35807]: [🡕] Process 35792 (fossilize_repla) of user 1000 dumped core.

                                                          Stack trace of thread 35792:
                                                          #0  0x000055d1650c75ce n/a (/home/user/.local/share/Steam/ubuntu12_64/fossilize_replay + 0x695ce)
                                                          ELF object binary architecture: AMD x86-64
Strykar commented 8 months ago

Just realized this issue is 4 years old..

I am also seeing this issue on Arch Linux:

[28741.985411] fossilize_repla[303073]: segfault at 1d00030309 ip 000072864503a78d sp 00005bcf26cf0eb8 error 4 in libc.so.6[728644fd7000+15b000] likely on CPU 5 (core 5, socket 0)
[28741.985449] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d
[28748.762310] fossilize_repla[303075]: segfault at 55dd00000c35 ip 000072864503a78d sp 00005bcf26cf0eb8 error 4 in libc.so.6[728644fd7000+15b000] likely on CPU 11 (core 13, socket 0)
[28748.762348] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d
[28750.877639] traps: fossilize_repla[303076] general protection fault ip:72864503a78d sp:5bcf26cf0eb8 error:0 in libc.so.6[728644fd7000+15b000]

FWIW I don't play many games, mostly Dota2 and Warframe. Steam sysinfo - https://gist.github.com/Strykar/07574caeaa8ecd0f3bfae5c077c3f876

kisak-valve commented 8 months ago

Hello @Strykar, Driver: Mesa llvmpipe (LLVM 16.0.6, 256 bits) in your system information tells us that Steam was forced to fallback to llvmpipe (mesa's faster CPU renderer) to run at all. This is an indicator that something's broken or incomplete with your video driver install. If you're using the NVIDIA proprietary driver and recently changed driver versions, the NVIDIA userspace libraries may not match the NVIDIA kernel module loaded into memory and the easiest way to clear that condition is to reboot.

Strykar commented 8 months ago

Thanks @kisak-valve but that is no longer the case today (nvidia-utils was a version behind yesterday). In spite of all Nvidia binary drivers and packages being in order it still logs:

[ 4936.499799] fossilize_repla[54043]: segfault at 55dd00000c35 ip 0000777788c1f78d sp 00005e2bdde99bf8 error 4 in libc.so.6[777788bbc000+15b000] likely on CPU 5 (core 5, socket 0)
[ 4936.499826] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d
[ 4936.504824] traps: fossilize_repla[54038] general protection fault ip:777788c1f78d sp:5e2bdde99bf8 error:0 in libc.so.6[777788bbc000+15b000]
[ 4936.750947] fossilize_repla[54034]: segfault at 17a00000482 ip 0000777788c1f78d sp 00005e2bdde99bf8 error 4 in libc.so.6[777788bbc000+15b000] likely on CPU 23 (core 13, socket 0)
[ 4936.750965] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d
[ 4937.278831] fossilize_repla[54008]: segfault at 2800000328 ip 0000777788c1f78d sp 00005e2bdde99bf8 error 4 in libc.so.6[777788bbc000+15b000] likely on CPU 21 (core 11, socket 0)
[ 4937.278850] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d
[ 4939.588249] fossilize_repla[54054]: segfault at c00040384 ip 0000777788c1f78d sp 00005e2bdde99bf8 error 4 in libc.so.6[777788bbc000+15b000] likely on CPU 23 (core 13, socket 0)
[ 4939.588269] Code: 83 f8 03 b8 00 00 04 00 48 0f 46 d0 31 c0 48 39 fa 0f 93 c0 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 48 8b 0c 25 10 00 00 00 <8b> 91 08 03 00 00 48 8d b9 08 03 00 00 89 d6 83 ce 02 39 d6 74 1d

Still logging Driver: Mesa llvmpipe (LLVM 16.0.6, 256 bits). Steam sysinfo - https://gist.github.com/Strykar/f70308b945cce671ac6863e0b4e54076