gemrb / gemrb

GemRB is a portable open-source implementation of Bioware’s Infinity Engine.
https://gemrb.org
GNU General Public License v2.0
929 stars 178 forks source link

Assertion failed: (delref), function GemRB_RemoveView, with clang on FreeBSD #2074

Open traveler-gemrb opened 2 months ago

traveler-gemrb commented 2 months ago

Bug description

gemrb-git build as Release (-O3 -DNDEBUG) fails here with

[ResourceManager]: Found 'GTRBPSK.mos' in 'chitin.key'. Assertion failed: (delref), function GemRB_RemoveView, file /usr/home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.

after (used git bisect)

commit 605519eddd7f105ef595dc492b98aacd91b5d6ff Author: Jaka Kranjc [lynxlupodian@users.sourceforge.net](mailto:lynxlupodian@users.sourceforge.net) Date: Tue Mar 26 11:07:12 2024 +0100

cmake: propagate compiler flag changes to parent scope

fixes last known regression and hopefully the odd crashes we've been having #2063

cmake/Helpers.cmake | 5 ++++- gemrb/core/Geometry.cpp | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-)

Env:

FreeBSD 14.0-STABLE #0 stable/14-e069c451f amd64

FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)

traveler-gemrb commented 2 months ago

To reproduce, one need to load game (BG1) or just load menu (IWD1).

lynxlynxlynx commented 2 months ago

Well please try what we talked about.

traveler-gemrb commented 2 months ago

Hello,

I've already tested default Release, that is (-O3 -DNDEBUG), without any custom -flto / -march=native etc. I think I've also tried -O2 (same result). I've tested -O now (same).

https://pastebin.com/ABRCLKv5 (default cmake build configuration with clang)

The problem does not exist with gcc13, however I've noticed that GUI have changed (??).

head/master (gcc13)

2024-04-13-214422_1440x900_scrot

ddce87c38 (clang)

2024-04-13-220255_1440x900_scrot

lynxlynxlynx commented 2 months ago

The GUI stuff is unrelated; if you build the same revision you should get the same results. It's just widescreen mod changes.

traveler-gemrb commented 2 months ago

Thanks for a clarification, how can I get back to previous style? I think that was the intended look.

lynxlynxlynx commented 2 months ago

I don't know which version you have, but from above reverting d47c4cf2 should do it (or copying the file).

traveler-gemrb commented 1 month ago

FWIW,

Same with 14.1-STABLE #0 stable/14-91df7d335 amd64 and FreeBSD clang version 18.1.5

lynxlynxlynx commented 1 month ago

@bsdcode can you do a quick check if it still runs for you? It's odd that you two would have so different experiences on FreeBSD so close in time.

bsdcode commented 1 month ago

Unfortunately because of lack of time I hadn't compiled and tested GemRB for over a month now I think. So I just compiled the most recent commit now and can confirm the problem.

I'm on FreeBSD 14.0-RELEASE-p6 and compiled with clang 16.0.6. I tested BG1. Starting a new game produces this after character creation:

Assertion failed: (ref), function GemRB_View_AddSubview, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 1051.

Loading a saved game produces this:

Assertion failed: (delref), function GemRB_RemoveView, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.

I can test with GCC later.

lynxlynxlynx commented 1 month ago

It sounds like #1786 all over again. :s It's worth a bisect between the end of February and now, since two of you reported everything was fine: https://github.com/gemrb/gemrb/issues/1786#issuecomment-1956514898

traveler-gemrb commented 1 month ago

I did bisect and my problem appeared with https://github.com/gemrb/gemrb/commit/605519eddd7f105ef595dc492b98aacd91b5d6ff

gcc13 here is fine at the moment, and was then too.

lynxlynxlynx commented 1 month ago

Oh, right, forgot you already did it, sorry. That commit is needed, since before all the compiler flags weren't being set after the modularisation of the system. At that time I compared command-lines (it's all printed if you do make VERBOSE=1).

This little patch shows that on my system, there's no preset set for the values, so even if there was some problem with overriding, it's not coming into play. Do you get the same results?

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a0cd50a88..fcabcae71 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -93,8 +93,14 @@ INCLUDE_DIRECTORIES(
 )

 INCLUDE(config)
+message(STATUS "INTRO VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
 CONFIGURE_LINKING()
 CONFIGURE_COMPILER()
+message(STATUS "POST VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
 CONFIGURE_PYTHON()
 CONFIGURE_SDL(${SDL_BACKEND})
 CONFIGURE_OPENGL(${OPENGL_BACKEND} ${SDL_BACKEND})

-- INTRO VALS: -- INTRO VALS: -- INTRO VALS: -- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing -Wcast-align -Wno-error=stringop-truncation -Wno-error=stringop-overflow -Wno-error=stringop-overread -Wimplicit-fallthrough=2 -Woverloaded-virtual=0 -- POST VALS: -- POST VALS: -Wl,--no-undefined

bsdcode commented 1 month ago

My results are similar to yours, I have less Warnings/no-errors:

-- INTRO VALS: -- INTRO VALS: -- INTRO VALS: -- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing -- POST VALS: -- POST VALS: -Wl,--no-undefined

After I reverted 605519e in cmake/Helpers.cmake I got empty POST VALS, but GemRB works again.

bsdcode commented 1 month ago

Dropping -fvisibility=hidden from cmake/Helpers.cmake seems to fix the issue. But I'm a little bit out of the loop of symbol visibility and how it affects the C++/python interactions, so I don't know what could break, also in regards to compiling with GCC.

lynxlynxlynx commented 1 month ago

That's odd, since even before the major cmake changes of 553ecf4, we were adding it if available:

       CHECK_CXX_COMPILER_FLAG("-fvisibility=hidden" VISIBILITY_HIDDEN)
       IF (VISIBILITY_HIDDEN AND NOT WIN32)
               string(APPEND CMAKE_CXX_FLAGS " -fvisibility=hidden")
       ENDIF ()
bradallred commented 1 month ago

I feel like we've touched on this elsewhere at some point (I think I wanted to drop visibility=hidden from core), but the issue has nothing to do with Python. There is possibly more than one issue in play as well. The gist is:

  1. we use dynamic_cast and hiding symbols could conceivably result in RTTI info getting stripped as an optimization if the compiler deduces its not used. It's allowed to do this because it's used across module boundaries, which is implementation defined behavior, because C++ has no concept of dynamically loaded code. This might be fixed by explicitly adding a compiler flag to keep RTTI info.
  2. if duplicate symbols exist in both libcore and GUIScripts.so, which one is used would be UB. This is a chronic issue with templates.
  3. could be that some classes are missing GEM_EXPORT

I still say it makes no sense to hide symbols in libcore, and I don't remember why we do it. Not sure if that alone would resolve this issue tho. I think the disunion is in a closed PR from me if you want to go look.

lynxlynxlynx commented 1 month ago

Let's not get sidetracked if not needed, as we've been conservative with visibility for over a decade and nobody reported any problems until we did @czarny247's mentioned cmake refactor. Let's try to get to the underlying trigger, is it just a newer clang version problem? Etc.

@traveler-gemrb / @bsdcode can you try building a3501e0?

bradallred commented 1 month ago

I'm not getting side tracked, I'm sharing my experience from decades of cross platform development regarding symbol visibility issues.

I'm suggesting that we identify the problematic symbols and ensure they are exported, including all the requisite RTTI info. Even then, there is no guarantee if you are finding yourself in a situation where the same symbol exists in multiple binary images (eg tamplates). A GemRB::foo defined in libcore is not the same as a GemRB::foo defined in GUIScripts.so and cannot be dynamically_cast to one. There may be linker flags/magic you can do to resolve this, however. Otherwise, it's going to require preprocessor magic to extern those when building plugins to prevent duplication.

lynxlynxlynx commented 1 month ago

And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice. I'm not saying it shouldn't be addressed at some point, but here we currently have a clear smoking gun.

bradallred commented 1 month ago

here we currently have a clear smoking gun

which is?

And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice

Thats my point tho, the nature of UB/IB is that it is allowed to work just fine. It's also allowed to stop working. We just witnessed this same phenomenon with SDL_Quit and plugin lifetimes.

lynxlynxlynx commented 1 month ago

The refactor ... we were miscompiling everything for several months.

MarcelHB commented 1 month ago

Don't know if it's helpful to narrow down the cause, but it's reproducible under a VM and the demo easily, and this is what returns nullptr here for whatever reason: https://github.com/gemrb/gemrb/blob/5cfd2e6c70dd01b41790e73cbcc70ec65b62c5c4/gemrb/plugins/GUIScript/GUIScript.cpp#L182-L185

Originating from that piece of script: https://github.com/gemrb/gemrb/blob/5cfd2e6c70dd01b41790e73cbcc70ec65b62c5c4/gemrb/GUIScripts/Console.py#L20

traveler-gemrb commented 1 month ago

@traveler-gemrb / @bsdcode can you try building a3501e0?

Sorry, was away - same problem here with clang :

(Assertion failed: (delref), function GemRB_RemoveView, file /home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.)

$ git log
commit a3501e03b96735fcc54a1dd74ef11bb759a1e83e (HEAD) Author: czarny247 mateusz.hercun@gmail.com Date: Sun Dec 17 19:29:57 2023 +0100

Add clangd .cache dir to .gitignore

I've checked again, https://github.com/gemrb/gemrb/commit/ddce87c38663bb9a445f8d66a1a9e6db45abd96d is ok.

lynxlynxlynx commented 1 month ago

Ok, thanks, so we can let cmake rest and it is a worse problem.