Open traveler-gemrb opened 2 months ago
To reproduce, one need to load game (BG1) or just load menu (IWD1).
Well please try what we talked about.
Hello,
I've already tested default Release, that is (-O3 -DNDEBUG), without any custom -flto / -march=native etc. I think I've also tried -O2 (same result). I've tested -O now (same).
https://pastebin.com/ABRCLKv5 (default cmake build configuration with clang)
The problem does not exist with gcc13, however I've noticed that GUI have changed (??).
head/master (gcc13)
ddce87c38 (clang)
The GUI stuff is unrelated; if you build the same revision you should get the same results. It's just widescreen mod changes.
Thanks for a clarification, how can I get back to previous style? I think that was the intended look.
I don't know which version you have, but from above reverting d47c4cf2 should do it (or copying the file).
FWIW,
Same with 14.1-STABLE #0 stable/14-91df7d335 amd64 and FreeBSD clang version 18.1.5
@bsdcode can you do a quick check if it still runs for you? It's odd that you two would have so different experiences on FreeBSD so close in time.
Unfortunately because of lack of time I hadn't compiled and tested GemRB for over a month now I think. So I just compiled the most recent commit now and can confirm the problem.
I'm on FreeBSD 14.0-RELEASE-p6 and compiled with clang 16.0.6. I tested BG1. Starting a new game produces this after character creation:
Assertion failed: (ref), function GemRB_View_AddSubview, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 1051.
Loading a saved game produces this:
Assertion failed: (delref), function GemRB_RemoveView, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.
I can test with GCC later.
It sounds like #1786 all over again. :s It's worth a bisect between the end of February and now, since two of you reported everything was fine: https://github.com/gemrb/gemrb/issues/1786#issuecomment-1956514898
I did bisect and my problem appeared with https://github.com/gemrb/gemrb/commit/605519eddd7f105ef595dc492b98aacd91b5d6ff
gcc13 here is fine at the moment, and was then too.
Oh, right, forgot you already did it, sorry. That commit is needed, since before all the compiler flags weren't being set after the modularisation of the system. At that time I compared command-lines (it's all printed if you do make VERBOSE=1
).
This little patch shows that on my system, there's no preset set for the values, so even if there was some problem with overriding, it's not coming into play. Do you get the same results?
diff --git a/CMakeLists.txt b/CMakeLists.txt
index a0cd50a88..fcabcae71 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -93,8 +93,14 @@ INCLUDE_DIRECTORIES(
)
INCLUDE(config)
+message(STATUS "INTRO VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
CONFIGURE_LINKING()
CONFIGURE_COMPILER()
+message(STATUS "POST VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
CONFIGURE_PYTHON()
CONFIGURE_SDL(${SDL_BACKEND})
CONFIGURE_OPENGL(${OPENGL_BACKEND} ${SDL_BACKEND})
-- INTRO VALS: -- INTRO VALS: -- INTRO VALS: -- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing -Wcast-align -Wno-error=stringop-truncation -Wno-error=stringop-overflow -Wno-error=stringop-overread -Wimplicit-fallthrough=2 -Woverloaded-virtual=0 -- POST VALS: -- POST VALS: -Wl,--no-undefined
My results are similar to yours, I have less Warnings/no-errors:
-- INTRO VALS: -- INTRO VALS: -- INTRO VALS: -- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing -- POST VALS: -- POST VALS: -Wl,--no-undefined
After I reverted 605519e in cmake/Helpers.cmake
I got empty POST VALS
, but GemRB works again.
Dropping -fvisibility=hidden
from cmake/Helpers.cmake
seems to fix the issue. But I'm a little bit out of the loop of symbol visibility and how it affects the C++/python interactions, so I don't know what could break, also in regards to compiling with GCC.
That's odd, since even before the major cmake changes of 553ecf4, we were adding it if available:
CHECK_CXX_COMPILER_FLAG("-fvisibility=hidden" VISIBILITY_HIDDEN)
IF (VISIBILITY_HIDDEN AND NOT WIN32)
string(APPEND CMAKE_CXX_FLAGS " -fvisibility=hidden")
ENDIF ()
I feel like we've touched on this elsewhere at some point (I think I wanted to drop visibility=hidden
from core
), but the issue has nothing to do with Python
. There is possibly more than one issue in play as well. The gist is:
dynamic_cast
and hiding symbols could conceivably result in RTTI info getting stripped as an optimization if the compiler deduces its not used. It's allowed to do this because it's used across module boundaries, which is implementation defined behavior, because C++ has no concept of dynamically loaded code. This might be fixed by explicitly adding a compiler flag to keep RTTI info.libcore
and GUIScripts.so
, which one is used would be UB. This is a chronic issue with templates.GEM_EXPORT
I still say it makes no sense to hide symbols in libcore
, and I don't remember why we do it. Not sure if that alone would resolve this issue tho. I think the disunion is in a closed PR from me if you want to go look.
Let's not get sidetracked if not needed, as we've been conservative with visibility for over a decade and nobody reported any problems until we did @czarny247's mentioned cmake refactor. Let's try to get to the underlying trigger, is it just a newer clang version problem? Etc.
@traveler-gemrb / @bsdcode can you try building a3501e0?
I'm not getting side tracked, I'm sharing my experience from decades of cross platform development regarding symbol visibility issues.
I'm suggesting that we identify the problematic symbols and ensure they are exported, including all the requisite RTTI info. Even then, there is no guarantee if you are finding yourself in a situation where the same symbol exists in multiple binary images (eg tamplates). A GemRB::foo
defined in libcore
is not the same as a GemRB::foo
defined in GUIScripts.so
and cannot be dynamically_cast to one. There may be linker flags/magic you can do to resolve this, however. Otherwise, it's going to require preprocessor magic to extern
those when building plugins to prevent duplication.
And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice. I'm not saying it shouldn't be addressed at some point, but here we currently have a clear smoking gun.
here we currently have a clear smoking gun
which is?
And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice
Thats my point tho, the nature of UB/IB is that it is allowed to work just fine. It's also allowed to stop working. We just witnessed this same phenomenon with SDL_Quit
and plugin lifetimes.
The refactor ... we were miscompiling everything for several months.
Don't know if it's helpful to narrow down the cause, but it's reproducible under a VM and the demo easily, and this is what returns nullptr
here for whatever reason: https://github.com/gemrb/gemrb/blob/5cfd2e6c70dd01b41790e73cbcc70ec65b62c5c4/gemrb/plugins/GUIScript/GUIScript.cpp#L182-L185
Originating from that piece of script: https://github.com/gemrb/gemrb/blob/5cfd2e6c70dd01b41790e73cbcc70ec65b62c5c4/gemrb/GUIScripts/Console.py#L20
@traveler-gemrb / @bsdcode can you try building a3501e0?
Sorry, was away - same problem here with clang :
(Assertion failed: (delref), function GemRB_RemoveView, file /home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.)
$ git log
commit a3501e03b96735fcc54a1dd74ef11bb759a1e83e (HEAD)
Author: czarny247 mateusz.hercun@gmail.com
Date: Sun Dec 17 19:29:57 2023 +0100
Add clangd .cache dir to .gitignore
I've checked again, https://github.com/gemrb/gemrb/commit/ddce87c38663bb9a445f8d66a1a9e6db45abd96d is ok.
Ok, thanks, so we can let cmake rest and it is a worse problem.
Bug description
gemrb-git build as Release (-O3 -DNDEBUG) fails here with
[ResourceManager]: Found 'GTRBPSK.mos' in 'chitin.key'. Assertion failed: (delref), function GemRB_RemoveView, file /usr/home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.
after (used git bisect)
commit 605519eddd7f105ef595dc492b98aacd91b5d6ff Author: Jaka Kranjc [lynxlupodian@users.sourceforge.net](mailto:lynxlupodian@users.sourceforge.net) Date: Tue Mar 26 11:07:12 2024 +0100
cmake: propagate compiler flag changes to parent scope
fixes last known regression and hopefully the odd crashes we've been having #2063
cmake/Helpers.cmake | 5 ++++- gemrb/core/Geometry.cpp | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-)
Env:
FreeBSD 14.0-STABLE #0 stable/14-e069c451f amd64
FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)