freeorion / freeorion

Source code repository of the FreeOrion project.
https://www.freeorion.org
858 stars 164 forks source link

Dynamic cast fails in Clang 6: SetShipPartMeter::Execute couldn't cast simple increment ValueRef to an Operation #2300

Open ischwarze opened 5 years ago

ischwarze commented 5 years ago

Environment

Description

When the player researches an upgrade like "Mass Driver 2", the damage done by the respective weapons is supposed to increase by one point, and similar for other upgrades of this kind, as soon as the ships having these weapons come back to supply. However, with some compilers, this does not happen; instead, nothing happens, the damage remains unchanged. With the two attached patches, i see the following log entry (using "effect" log level "trace"):

[error] server : Effect.cpp:693 : SetShipPartMeter::Execute couldn't cast simple increment ValueRef to an Operation. Reverting to standard execute. [error] server : Effect.cpp:695 : typeid(ValueRefBase): N8ValueRef12ValueRefBaseIdEE [13398446319136] [error] server : Effect.cpp:697 : typeid(Operation): N8ValueRef9OperationIdEE [13398446319168] [error] server : Effect.cpp:699 : typeid(m_value): N8ValueRef9OperationIdEE [13397535772400]

So the problem is that at least with the following compiler, the preceding dynamic_cast fails and returns 0:

$ uname -a OpenBSD isnote.usta.de 6.4 GENERIC.MP#341 amd64 $ which cc /usr/bin/cc $ /usr/bin/cc --version
OpenBSD clang version 6.0.0 (tags/RELEASE_600/final) (based on LLVM 6.0.0) Target: amd64-unknown-openbsd6.4 Thread model: posix InstalledDir: /usr/bin

I'm not completely sure why the RTTI downcast fails: The source class has virtual functions and is only a parent of the target class in one way, and the SimpleIncrement() check right before seems to indicate that the object is indeed of type ::Operation. But from the above debugging output, it appears that while the typeid names match, the hashes do not, which may indicate some kind of subtle type mismatch somewhere deep in the guts of the parser, which is not exactly easy to read, so i'm sorry to admit that i was unable to find the root cause.

The lower attached patch is an immediate bugfix in the sense of a workaround, to commit right away. The same workaround is already present at other places in the code. The upper patch is not intended for commit but to help you get started finding the root cause in the autogenerated parser code.

Thanks, Ingo.

SetShipPartMeter.patch.txt

geoffthemedio commented 5 years ago

Could you make a pull request? I'd like to use your patch with you properly attributed, but that log listing isn't parsed correctly as a patch file.

ischwarze commented 5 years ago

No, the github web GUI is totally unusable in my opinion. I can use git, but as a matter of principle, i only interact with github over the normal git(1) command-line UI, i never fork repos on github, and i never submit pull requests for anything. Just merge the patch in whichever way you want, i don't care about attribution on github. Thanks!

geoffthemedio commented 5 years ago

In that case, can you at least make it an actual patch file, not a commit log listing? Based on what TortoiseGit tells me it's doing, something like git format-patch -o "/path/to/example.patch" -2 where the 2 indicates how many commits to make patches for (eg. the last two commits to the local repository will have patch files generated for them). For me just now, the result is two files, with contents like

From 4ea1a04f252871d432412cd3604a9af78bc9f978 Mon Sep 17 00:00:00 2001
From: Morlic-fo <Morlic@gmx.de>
Date: Wed, 3 Oct 2018 09:19:10 +0200
Subject: [PATCH 1/2] Fix incorrect log call

---
 default/python/AI/ProductionAI.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/default/python/AI/ProductionAI.py b/default/python/AI/ProductionAI.py
index eeeedce00..92d67fd2e 100644
--- a/default/python/AI/ProductionAI.py
+++ b/default/python/AI/ProductionAI.py
@@ -407,7 +407,7 @@ def generate_production_orders():
                 for pid in pids:
                     best_design_id, col_design, build_choices = get_best_ship_info(PriorityType.PRODUCTION_ORBITAL_DEFENSE, pid)
                     if not best_design_id:
-                        debug("no orbital defenses can be built at ", PlanetUtilsAI.planet_string(pid))
+                        debug("no orbital defenses can be built at %s", PlanetUtilsAI.planet_string(pid))
                         continue
                     retval = fo.issueEnqueueShipProductionOrder(best_design_id, pid)
                     debug("queueing %d Orbital Defenses at %s" % (num_needed, PlanetUtilsAI.planet_string(pid)))
-- 
2.16.2.windows.1

In general this would be rather helpful, as aside from the attribution issue, I (or others) will otherwise need to manually re-implement whatever changes you've suggested.

Vezzra commented 5 years ago

@geoffthemedio: you just can copy the actual diff parts from the patch file @ischwarze provided (lines 9-30 and 41-54) and save them to separate patch files (or paste them into a "apply patch" function which can take a diff text if TortoiseGit has that).

Anyway, I just created a PR (#2301) for the immediate bugfix, you can merge that right away if you're ok with it.

geoffthemedio commented 5 years ago

I tried that, but it gave errors and wouldn't apply the changes. And despite @ischwarze not caring, I prefer not to misattribute commits.

ischwarze commented 5 years ago

Here is exactly the same patch again formatted with the command geoffthemedio asked for:

$ git format-patch -o SetShipPartMeter.format-patch.txt -1

The result is byte-for-byte identical, so i'm sorry to say that i don't have the slightest idea what your problem is. You should be able to apply and and push it without any problems. Without seeing error messages from the git(1) commands you type, i can't try to help with those mysterious problems...

So here is the patch, rebased on top of the intervening commit and regenerated with the command you asked for - i hope that now works for you:

SetShipPartMeter.format-patch.txt

Vezzra commented 5 years ago

I can only make guesses here: I suspect TortoiseGit doesn't handle patch/diff files well. I had no problem copying the diff parts and applying them with SourceTree's (that's the git GUI client I use) "Apply patch" function at all.

Working with git command line should be even less of a problem (with patch files generated by git).

Anyway, @geoffthemedio: anything preventing just merging the PR I created out of the patch @ischwarze provided (#2301)? It has no conflicts yet, so no prior rebase necessary.

Afterwards I can make a branch in master with the second part of @ischwarze's patch applied, the one which can be used to track down the underlying problem, if so desired...?

geoffthemedio commented 5 years ago

The two patches @ischwarze provided are not identical... The latter's metadata is formatted as a patch, not as a long output, and applies correctly. Thanks for reposting it.

geoffthemedio commented 5 years ago

Committed the bugfix patch in https://github.com/freeorion/freeorion/commit/b0abc77d414d5a4922ed29acafdd7c3f6cd85c1f

If someone who wants to investigate has a suitable system that can produce the error, they can make the suggested changes relatively easily without need for a separate branch, I think...

--- a/universe/Effect.cpp
+++ b/universe/Effect.cpp
@@ -687,9 +687,16 @@ void SetShipPartMeter::Execute(const ScriptingContext& context, const TargetSet&
     } else if (m_value->SimpleIncrement()) {
         // meter value is a consistent constant increment for each target, so handle with
         // deep inspection single ValueRef evaluation
-        auto op = dynamic_cast<ValueRef::Operation<double>*>(m_value.get());
+        ValueRef::ValueRefBase<double>* base = m_value.get();
+        auto op = dynamic_cast<ValueRef::Operation<double>*>(base);
         if (!op) {
             ErrorLogger() << "SetShipPartMeter::Execute couldn't cast simple increment ValueRef to an Operation. Reverting to standard execute.";
+            const std::type_info& tid1 = typeid(ValueRef::ValueRefBase<double>);
+            ErrorLogger() << "typeid(ValueRefBase): " << tid1.name() << " [" << tid1.hash_code() << "]";
+            const std::type_info& tid2 = typeid(ValueRef::Operation<double>);
+            ErrorLogger() << "typeid(Operation): " << tid2.name() << " [" << tid2.hash_code() << "]";
+            const std::type_info& tid3 = typeid(*base);
+            ErrorLogger() << "typeid(m_value): " << tid3.name() << " [" << tid3.hash_code() << "]";
             EffectBase::Execute(context, targets);
             return;
         }

It might make sense to rename this or make a new issue about the cast failing on some systems, though...

Vezzra commented 5 years ago

Ok, I'll close #2301.

Vezzra commented 5 years ago

As the committed patch isn't a "real" fix, I guess we should keep this open...?

ischwarze commented 5 years ago

@Vezzra Keeping it open is indeed my recommendation. I suspect that there is some subtle issue in the code, like relying on some kind of undefined behaviour somewhere in the autogenerated code in the parser, that could possibly cause trouble elsewhere in the code, too. So if somebody who understands both C++ and the FreeOrion codebase well could investigate with Clang 6, that would seem useful to me.

Of course, in principle, it could also be a bug in Clang 6 - but i kind of doubt that the first issue i trip over round here is a compiler bug in a mature compiler product from a major company - and reporting a compiler bug would also require understanding the issue better...

ischwarze commented 5 years ago

I have more information, though i still don't understand what is going on. But maybe it rings some bells with the C++ wizards here.

I managed to get freeoriond (compiled with clang 6 with debugging symbols (-g) and with my debugging patch as posted above) to run under the GNU debugger, specifically:

$ egdb --version
GNU gdb (GDB) 7.12.1
[...]
This GDB was configured as "x86_64-unknown-openbsd6.4".

I set a breakpoint in Effect::SetShipPartMeter::Execute(), inside the "if (!op)" right before the "EffectBase::Execute(context, targets);", and here is what i see:

(gdb) whatis base
type = ValueRef::ValueRefBase<double> *
(gdb) p base
$1 = (ValueRef::ValueRefBase<double> *) 0x4a1018154c0
(gdb) whatis *base
type = ValueRef::ValueRefBase<double>
(gdb) p *base
$2 = {
  _vptr$ValueRefBase = 0x4a0f62b1438 
  <vtable for ValueRef::Operation<double>+16> 
}

So the ValueRef object does have a vtable (not surprising since SimpleIncrement() succeeds, and on first sight, it appears to be of the expected derived type. However:

(gdb) info vtbl base
vtable for 'ValueRef::ValueRefBase<double>' @ 0x4a0f62b1438 (subobject @ 0x4a1018154c0):
[0]: 0x4a0f58544c0
utils.c:1073: internal-error: virtual memory exhausted: can't allocate 25032 bytes.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

(gdb) p base->_vptr$ValueRefBase
Recursive internal problem.
Abort trap (core dumped)

In a different gdb session, i got a similar result accessing m_value directly, without triggering the memory exhaustion first:

(gdb) info vtbl m_value.__ptr_.__value_
vtable for 'ValueRef::ValueRefBase<double>' @ 0x15a9c159e438 (subobject @ 0x15a95629e540):
Recursive internal problem.
Abort trap (core dumped)

Something seems to be fishy with the vtable, like something self-referential triggering an endless loop. If the vtable is somehow broken, that might explain why the dynamic_cast fails. But i don't know yet whether (and how) the vtable is broken, and much less why, and i'm not quite sure how to investigate further if i cannot even inspect the vtable.

Besides, with debugging symbols, your libraries are positively gigantic, and gdb(1) takes a very long time to read debugging symbols from them, several minutes on my machine for each attaching of gdb to freeoriond, so debugging is somewhat painful...

$ ls -hal /usr/local/lib/freeorion/
total 3241636
drwxr-xr-x   2 root  wheel   512B Oct 23 21:13 .
drwxr-xr-x  68 root  wheel  46.5K Oct 23 20:34 ..
-rw-r--r--   1 root  bin    60.2M Oct 23 18:38 libGiGi.so
-rw-r--r--   1 root  bin     3.0M Oct 23 18:38 libGiGiSDL.so
-rw-r--r--   1 root  bin     162M Oct 23 19:05 libfreeorioncommon.so
-rw-r--r--   1 root  bin     1.3G Oct 23 18:56 libfreeorionparse.so

Any ideas what might be going on here?

ischwarze commented 5 years ago

Here is some more progress.

Again, i attached gdb 7.12.1 to freeoriond and set a breakpoint immediately after printing these messages:

[error] server : Effect.cpp:693 : SetShipPartMeter::Execute couldn't cast
simple increment ValueRef to an Operation. Reverting to standard execute.
typeid(ValueRefBase):N8ValueRef12ValueRefBaseIdEE[8142361567568] 0x767ca91a150
typeid(Operation):   N8ValueRef9OperationIdEE    [8142361567600] 0x767ca91a170
typeid(m_value):     N8ValueRef9OperationIdEE    [8142103603136] 0x767bb3167c0

And now let's inspect the Operation object:

The location of the local variable "base" on the stack:

(gdb) p (long ***)&base
$2 = (long ***) 0x7f7ffffcfd28

The location where the object itself is stored on the heap:

(gdb) p base
$3 = (ValueRef::ValueRefBase<double> *) 0x76877040980
(gdb) p *(long ***)&base
$4 = (long **) 0x76877040980

The pointer to the vtable:

(gdb) p *base
$5 = {
  _vptr$ValueRefBase = 0x767bb97c438
  <vtable for ValueRef::Operation<double>+16>
}
(gdb) p **(long ***)&base
$6 = (long *) 0x767bb97c438 <vtable for ValueRef::Operation<double>+16>

The contents of the vtable:

(gdb) x/24xg 0x767bb97c438
0x767bb97c438 <_ZTVN8ValueRef9OperationIdEE+16>:        0x00000767baf1f4c0
                                                        0x00000767baf1f560
0x767bb97c448 <_ZTVN8ValueRef9OperationIdEE+32>:        0x00000767baf1f5e0
                                                        0x00000767baf1f6f0
0x767bb97c458 <_ZTVN8ValueRef9OperationIdEE+48>:        0x00000767baf1f730
                                                        0x00000767baf1f7b0
0x767bb97c468 <_ZTVN8ValueRef9OperationIdEE+64>:        0x00000767baf1f830
                                                        0x00000767baf1f8b0
0x767bb97c478 <_ZTVN8ValueRef9OperationIdEE+80>:        0x00000767baf1f930
                                                        0x00000767baf1f9d0
0x767bb97c488 <_ZTVN8ValueRef9OperationIdEE+96>:        0x00000767baf1fa00
                                                        0x00000767baf20890
0x767bb97c498 <_ZTVN8ValueRef9OperationIdEE+112>:       0x00000767baf217e0
                                                        0x00000767baf21850
0x767bb97c4a8 <_ZTIN8ValueRef12ValueRefBaseIdEE>:       0x00000767f8de46a8
                                                        0x00000767bb3167e0
0x767bb97c4b8:                                          0x0000000000000000
                                                        0x00000767f8de46f8
0x767bb97c4c8 <_ZTIN8ValueRef9OperationIdEE+8>:         0x00000767bb3167c0
                                                        0x00000767bb97c4a8
0x767bb97c4d8:                                          0x0000000000000000
                                                        0x00000767f8de46f8
0x767bb97c4e8 <_ZTIN8ValueRef8VariableIdEE+8>:          0x00000767bb316800
                                                        0x00000767bb97c4a8

Here is a new observation: with this compiler, the typeid() "hashes" are actually virtual memory addresses: typeid(m_value) = 0x767bb3167c0 = location pointed to by <_ZTIN8ValueRef9OperationIdEE+8>.

So let's look at these locations - for the actual ::Operation class first:

(gdb) x/5xg 0x767ca91a170
0x767ca91a170 <_ZTSN8ValueRef9OperationIdEE>:           0x5265756c6156384e
                                                        0x617265704f396665
0x767ca91a180 <_ZTSN8ValueRef9OperationIdEE+16>:        0x454564496e6f6974
                                                        0x0000000000000000
0x767ca91a190 <_ZTSNSt3__16vectorINS_10unique_ptrIN6Effect10EffectBaseENS
  _14default_deleteIS3_EEEENS_9allocatorIS6_EEEE>:      0x36315f5f3374534e

So this is a 24-byte object with some relation to ValueRef::Operator, observe the similar name, s/^_ZTV/_ZTS/.

Now for the object obtained from m_value:

(gdb) x/5xg 0x767bb3167c0
0x767bb3167c0 <_ZTSN8ValueRef9OperationIdEE>:           0x5265756c6156384e
                                                        0x617265704f396665
0x767bb3167d0 <_ZTSN8ValueRef9OperationIdEE+16>:        0x454564496e6f6974
                                                        0x0000000000000000
0x767bb3167e0 <_ZTSN8ValueRef12ValueRefBaseIdEE>:       0x5265756c6156384e

That's the same content, but at a different memory location, and (consequently?) a different typeid.

Looking at the bytes, i noticed they are all in the ASCII range, so the "S" in _ZTS probably means "string", and indeed:

(gdb) x/s 0x767ca91a170
0x767ca91a170 <_ZTSN8ValueRef9OperationIdEE>:   "N8ValueRef9OperationIdEE"

That's the name returned from typeid(). Two copies of it.

So maybe the compiler is somehow constructing the same type ValueRef::Operator twice, with both copies being functional, with the type name "ValueRef::Operator" in C++ code refering to the first copy, but with some of your actual objects being constructed by the parser as instances of the second copy, and then the compiler is treating both copies of the type as distinct types, hence being unable to downcast because it thinks it got the wrong type, even though it's actually some kind of copy of the same type?

In a different debugging session, where the vtable of the object from m_value started with

0x........... <_ZTVN8ValueRef9OperationIdEE+16>: 0x00001d755a3f74c0

i tried to inspect this first function pointer:

(gdb) disas 0x00001d755a3f74c0
Recursive internal problem.
Abort trap (core dumped)

So something is possibly wrong with the copy of the class.

Can you think of anything in your parser code that might instantiate ::Operation objects into a copy of the ValueRef::Operation class rather than into the class itself?

Dilvish-fo commented 5 years ago

Can you think of anything in your parser code that might instantiate ::Operation objects into a copy of the ValueRef::Operation class rather than into the class itself?

ValueRef::Operation is templated, so every specialization <T> is a distinct class, right? I see that while most of the parser uses of ValueRef::Operation are themselves expressed in terms of ValueRef::Operation<T>, there are a couple that specifically code for ValueRef::Operation<std::string>, and perhaps some of the other ones are winding up being some similar variant, like ValueRef::Operation<std::wstring> or ValueRef::Operation<std::basic_string> or somesuch, could that be causing the problem you are finding?

The actual instantiations of ValueRef::Operation objects by the parser are all handled within the bowels of the boost spirit new_ 'lazy' object construction function, it appears to me.

ischwarze commented 5 years ago

ValueRef::Operation is templated, so every specialization <T> is a distinct class, right?

I think that is true.

I see that while most of the parser uses of ValueRef::Operation are themselves expressed in terms of ValueRef::Operation<T>, there are a couple that specifically code for ValueRef::Operation<std::string>, and perhaps some of the other ones are winding up being some similar variant, like ValueRef::Operation<std::wstring> or ValueRef::Operation<std::basic_string> or somesuch, could that be causing the problem you are finding?

I can't totally exclude that the object type mismatch is caused by a mismatch of the template parameter, but i somewhat doubt it, because in that case, i would already expect the assignment

ValueRef::ValueRefBase<double>* base = m_value.get();

that i introduced for debugging to fail (at compile time, even) - right? So "*base" does appear to belong to a derived class of ValueRef::ValueRefBase, right? Maybe just not exactly the right derived class, or a copy of it that the compiler considers distinct... (?)

I have tested my above guess regarding the template parameter by changing the above line to

ValueRef::ValueRefBase<float>* base = m_value.get();

and recompiling the single file:

/usr/ports/pobj/freeorion-0.4.8.1/bin/c++ -DBINPATH=\"/usr/local/bin\" -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LINK -DBOOST_OPTIONAL_CONFIG_USE_OLD_IMPLEMENTATION_OF_OPTIONAL -DENABLE_BINRELOC -DFREEORION_BUILD_COMMON -DFREEORION_OPENBSD -DSHAREPATH=\"/usr/local/share\" -Dfreeorioncommon_EXPORTS -isystem /usr/ports/pobj/freeorion-0.4.8.1/src-tarball -isystem /usr/local/include -isystem /usr/ports/pobj/freeorion-0.4.8.1/src-tarball/GG -O2 -pipe -g -fPIC -Wall -fvisibility=hidden -ftemplate-depth=512 -std=c++11 -MD -MT CMakeFiles/freeorioncommon.dir/universe/Effect.cpp.o -MF CMakeFiles/freeorioncommon.dir/universe/Effect.cpp.o.d -o CMakeFiles/freeorioncommon.dir/universe/Effect.cpp.o -c /usr/ports/pobj/freeorion-0.4.8.1/src-tarball/universe/Effect.cpp

/usr/ports/pobj/freeorion-0.4.8.1/src-tarball/universe/Effect.cpp:690:40: error:
cannot initialize a variable of type 'ValueRef::ValueRefBase<float> *'
with an rvalue of type
'std::__1::unique_ptr<ValueRef::ValueRefBase<double>,
std::__1::default_delete<ValueRef::ValueRefBase<double> > >::pointer'
(aka 'ValueRef::ValueRefBase<double> *')
ValueRef::ValueRefBase<float>* base = m_value.get();
                               ^      ~~~~~~~~~~~~~
1 error generated.

So, it does not compile with an even slightly mismatching template argument, and the compiler error message also indicates that the "double" used in your current code is indeed the right template argument.

The actual instantiations of ValueRef::Operation objects by the parser are all handled within the bowels of the boost spirit new_ 'lazy' object construction function, it appears to me.

You mean, in ValueRefParser.cpp, namespace parse, namespace detail, function arithmetic_rules(), line 148,

_b = construct_movable_(new_<ValueRef::Operation<T>>(
    _c,
    deconstruct_movable_(_a, _pass),
    deconstruct_movable_(_1, _pass)))

I did compile boost with debugging symbols and i can try to trace down into that if that is where the object is constructed, but i have no experience with such stuff and i'm not sure i'll find anything down there...

So if anybody with more C++ experience would manage to reproduce the problem with Clang (assuming you are normally using GCC and it doesn't occur with that), the root cause might possibly be found with less delay...

Oh, i just noticed i also happen to have GCC installed:

$ egcc --version
egcc (GCC) 4.9.4
[...]

I can also try to reproduce with that, but it will require yet again several hours of compiling, and i'm not sure what exactly it would tell us, no matter whether or not the problem also occurs, or does not occur, with GCC. Debugging probably needs to be done with a compiler that actually finds the problem, and it probably doesn't matter which one, as long as it doesn't hide the problem. If there are specific reasons to do so, i could probably also install GCC 6.4 - but not GCC 7, 8, or 9, that would cause too much work because i don't really feel like trying to port a new compiler for this purpose.

geoffthemedio commented 5 years ago

Could it be just that the cast is auto op = dynamic_cast<ValueRef::Operation<double>*>(m_value.get()); in SetShipPartMeter::Execute(const ScriptingContext& context, const TargetSet& targets) const, which is const, so this->m_value is also const?

If so, just changing to auto op = dynamic_cast<const ValueRef::Operation<double>*>(m_value.get()); might help?

ischwarze commented 5 years ago

Hi @geoffthemedio,

If so, just changing to auto op = dynamic_cast<const ValueRef::Operation<double>*>(m_value.get()); might help?

I just tested, with the even more explicit

const ValueRef::ValueRefBase<double>* base = m_value.get();
auto op = dynamic_cast<const ValueRef::Operation<double>*>(base);

and it changes nothing: op is still 0.

I think you are right that dynamic_cast cannot cast away const, and that m_ value is const, but the return value from unique_ptr<T>::get() is still T*, not const T*, even when the unique_ptr object itself is const - that only means we cannot change the content of the container object, but the content itself need not be const. If i understand correctly, if the content were const, as in unique_ptr<const T>, our code would already error out at compile time.

Of course, the above variant with const compiles too, because you can always add const, as long as you don't modify the object in question (here: the Operation), which we don't. But the downcast returns 0 just as before.