Closed 0xBEEEF closed 3 years ago
So just to keep you up-to-date on what we've found... I strongly suspect that the problem is related to the aggressive inlining occurring in this particular build. For example, contrast your results with the ones we get from ooex5 in our ooanalyzer test suite.
Several years ago, when we were just getting started on OOAnalyzer, handling inlining correctly seemed like an ambitious task. At the time we disabled inlining in the ooex executables using a compilation option. We've since expanded our testing to include more executables that do have inlining, but we've not reviewed the results at the level of detail you currently are. Thanks for that!
So the current theory is that if you disable inlining, or build the executable from multiple object modules that would prevent the inlining (this is common in many real-world software cases which is probably why we've missed it until now) the problem should go away. Regardless, there is a bug in OOAnalyzer when inlining has occurred, and we're looking for the cause now. This is related to the VBTable issue as well, which I'll comment on in the other ticket.
Thanks for the detailed information! This might solve the problem with the test cases. All larger programs, whose contents I didn't know, and used for fun with OOAnalyzer, show exactly the same behaviour. So it seems to be common that inlining is used a lot in practice. I don't think it's that great now either, but I would like to support you with practical examples. And here, unfortunately, inlining happens frequently, sometimes so crazy that there are no more custructors at all, but the whole code for object creation is directly included in the rounding up function. Unfortunately I had to go through something like that.
In my experience, these are all (unfortunately) recurring problems in practice. Apart from all the code optimizations. If this is used a lot, the whole analysis is no fun anymore, because it is virtually unreadable.
I agree that inlining is an annoying complication for analysis. I probably should not have implied that it was uncommon in real-world programs. Obviously it is very common, and so we must do our best with it. I was just remarking that implementing the constructor in a separate object module prevents this from occurring which helps a lot but is not typical in test cases. I guess the counterpoint is that with the growth of templates and classes implemented entirely in header files, it's also quite common in real-world programs. We'll continue to do what we can to improve correct detections in the case of inlining.
As for the case where the constructor is eliminated entirely, we've seen this before as well, and it represents a more significant problem for OOAnalyzer, since we're not always collecting the facts we need for the Prolog rules to reason correctly about the situation. In particular, our detection of possible VFTable write facts is now only performed on __thiscall methods, because we have built-in presumptions about which this-pointer we're looking for the VFTable writes into. So if the constructor was rolled into main() for example, we don't have the required object pointer tracking to understand which object the VFTables were being written into in order to generate the required facts. So this is more difficult to fix that just improving the Prolog rules.
Thanks for your answer. Don't worry, I think I heard you right.
Since you are basically imitating the manual process with your facts, I understand you very well. If you want to find out all this manually, and you have the necessary knowledge, it still takes a long time to somehow see through what is happening now. But in my opinion it is mainly the optimizations that are to blame. They make the code faster, but they make it totally unintelligible and confusing. This also shows that Ghidra has problems decompiling some methods because they are so distorted that the sense is lost.
I am sure that your team will find a solution here. It would be enough if it works for 3/4 of all classes. Then you don't have to rework too much. Rather some support than everything manually.
By the way, this is one of the reasons why I wrote in Issue #109 that a semi-automatic user-supported analysis would be great. Here you could intervene in the process and straighten out such facts. It would still require some manual effort, but in my opinion the overall results would have to be much better.
We've continued to look at this and have not updated the issue. One thing we forgot about is that OOAnalyzer does not have a way to communicate secondary VFTables in the json format. There may be other troubles too with inlining, but even if we found the vftable right now it wouldn't be communicated to Ghidra. This shouldn't be too difficult to fix.
I wanted to update again. My previous comment was not quite right. I was looking at the json and noticed that we (almost) never have more than one vftptr member per class.
But this is actually by design. What is supposed to happen in the case of inheritance is that we created a member for the base class, which contains the vftptr.
What we've seen is that in the binaries you've provided, we are generally not detecting the inheritance very well at all even when RTTI reveals the inheritance relationship. And because of the way that members are handled, if we don't detect the inheritance, the inherited vftptr is simply declared as a regular member. We could fix that but it would just be a band-aid for the real problem of not detecting the inheritance.
As @sei-ccohen said a few days back, the root problem is that we haven't tested enough on inlined binaries. Now that we're looking into the problem more, we've found a lot of bugs (and even a design issue). So we're still working on this, and when we're done OOAnalyzer will hopefully work a lot better on inlined binaries.
We are making good progress!
Derived * __thiscall Derived(Derived *this)
{
int **in_FS_OFFSET;
int *local_10;
undefined *puStack12;
undefined4 local_8;
local_8 = 0xffffffff;
puStack12 = &LAB_00402588;
local_10 = *in_FS_OFFSET;
*(int ***)in_FS_OFFSET = &local_10;
(this->Base1).vftptr_0x0 = (Base1::vftable_403210 *)0x403210;
FUN_004013d0((int *)cout_exref,"Base 1 Constructor");
(this->Base1).mbr_0x8 = 0;
*(undefined8 *)&(this->Base1).field_0x10 = 0;
(this->Base1).mbr_0x24 = 0;
this->mbr_0x28 = 0x7b;
*(undefined4 *)&(this->Base1).field_0x1c = 0;
*(undefined4 *)&(this->Base1).field_0x18 = 0;
(this->Base2).vftptr_0x0 = (Base2::vftable_403224 *)0x403224;
(this->Base2).mbr_0x4 = 0x405d2f1b;
local_8 = 1;
(this->Base1).vftptr_0x0 = (Base1::vftable_403210 *)0x4031fc;
(this->Base2).vftptr_0x0 = (Base2::vftable_403224 *)0x403234;
FUN_004013d0((int *)cout_exref,"Derived Constructor");
(this->Base1).mbr_0x8 = 0x41200000;
*(undefined8 *)&(this->Base1).field_0x10 = 0x4002c28f5c28f5c3;
*in_FS_OFFSET = local_10;
return this;
}
The virtual call you referenced looks a lot nicer now too:
this_00 = (Derived *)FUN_004016eb(0x40);
local_8 = 1;
this_00 = Derived(this_00);
local_8 = 0xffffffff;
(*((this_00->Base2).vftptr_0x0)->FUN_00401390_0)(&this_00->Base2,param_1);
Not sure why it named the vftable entry FUN_00401390_0
instead of Base2::~Base
.
Great! This is really great news. Such a result is really outstanding. Great praise to you and your team!
Is it already foreseeable when the next public version will be released? I am so curious about the customizations because the result you wrote above looks so fantastic.
We are working on a release. I hope that will be sometime this week. Some of the changes we made are fairly substantial and we have been trying to make sure we didn't substantially break anything else.
I have not forgot about this, but we keep uncovering bugs on this branch. I think we're getting close finally, but not sure it will happen this week.
Everything is fine, just do not panic! It takes as long as it takes. My question above was also rather general, and should not create any pressure with you. By creating this issue I was not aware of the chain reaction I caused. After all what you write, it seems to really concern things that are deeply anchored in the system.
Thanks for understanding. We just wanted to let you know we are still working on it. Many of the bugs only manifest after running for hours, which makes fixing them tricky.
We finally updated our code. But I just went back to verify that everything worked ok on your example and we're getting an error right now. Hopefully it won't take as long to fix this :)
Sounds great! Just no stress. Until you have the outstanding correction I can easily wait. Just one question in advance. Up to which step do I have to do the analysis of an existing program again? Since the beginning, or can I still use any intermediate results?
Only the prolog has changed, so you can start from your old .facts
files if you have them.
The problem here is that there are multiple methods with the same bodies (clone
, mumble
) and the compiler is implementing them with a single function in the executable. We call this a shared implementation. This is very problematic because OOAnalyzer has a fairly fundamental assumption that a method in the executable can only be in a single class. RTTI tells us that Base1::vftable
and Base2::vftable
are on different classes and have no base classes. But because Base1::clone
and Base2::clone
are actually implemented by the same implementation in the executable, we mistakenly conclude they are on the same class because the shared implementation appears in both the vftable for Base1
and Base2
. This is what leads to the contradiction and the error.
The reason we are seeing this now and not before is because we recently fixed a bug that prevented us from detecting such contradictions during the initial forward reasoning stage. So it was most likely happening, but we were ignoring it.
We've tested a variety of programs and did not encounter this problem (the contradiction occurring during forward reasoning). So the problem in this case may be exhibiting in part because some of the method bodies have been stripped away. So one way to work around the problem is to remove the shared implementations. I've done that here and OOAnalyzer can handle this version of the program without an error because the shared implementations are gone.
We are still thinking of how to handle the general shared implementation problem in general, and will talk more about it later today. I suspect it's often a problem for templated classes. It probably causes us to merge different instantiations of templated classes together.
Anyway, that was a long comment but I thought you might like to understand what was going on. I would suggest trying your real analysis target again. If you do run into this error, let us know. In the meantime we'll be thinking about a more general solution.
Many thanks for the detailed statement. Very interesting what insights you have gained. But unfortunately it shows once again that all the optimizations make the whole thing so complicated. If I don't let the whole thing be optimized, as you prescribed at the beginning, then it is indeed the case that everything works. By the way, you can also see this in the Compiler Explorer, which you probably know as well. It's only a few house numbers away from you here on GitHub.
When you play with it, it all seems very logical and you understand what's happening. But even here, by default, there is no, or only minimal, optimization.
Unfortunately, these make it all unnecessarily complicated. You could almost talk about a kind of obfuscation, even if only in a figurative sense.
Take your time with the editing! It is the quality of the output that is important, not the time it takes to implement it.
Just a general thought to conclude: If you know that such cases exist and the automatic assumptions are difficult or require manual control. Wouldn't it make sense to have such a semi-automatic process, directly in Ghidra / IDA, with intermediate results and a conflict tool? This would significantly increase the strength of your analysis algorithms as well as the overall quality, because the human being would still have a checking task, so to speak.
We've correlated the "multiple implementation" problem, which manifests as multiple symbols sharing the same address, with the linker optimization feature known as "link-time code generation". The documentation for this option is here:
https://docs.microsoft.com/en-us/cpp/build/reference/ltcg-link-time-code-generation?view=vs-2019
Contributing to the complexity of the issue, it appears that this option is on by default in Visual Studio for Release builds, off by default in CMake for Release builds, and generally off in debugging builds. When linking against libraries, the libraries may have been built with this option, even if your executable was not. We're continuing to explore the issue, but please be aware that this option is proving to be very important for the behavior that is causing problems in OOAnalyzer. If you're still testing OOAnalyzer, my recommendation is to turn this feature and /GL both off, and you should get better results. Or at least results that better match our expectations for the capabilities of OOAnlayzer.
Thanks for the information. But now a question for understanding. Are these optimizations now related to the inlining in the Constructor, or is that something else again? I have reported problems in some places, and never consciously considered this option. Here I can of course deliberately turn it off for testing. But imagine that I am analyzing an application that is foreign to me. How can I determine whether this option was active? Just by inlining several constructors, or are there other symptoms that speak for it?
No, I don't think its related to constructor inlining. For the most part I think inlining is solvable but our understanding has been incomplete. In the last release we changed several rules related to vftable reasoning that were subtly wrong because of inlining. And in general our understanding is improving over time. The shared implementation is more of a fundamental problem (though we are still learning how often and how it effects results in practice).
Sent from Workspace ONE Boxer
On Sep 25, 2020 1:14 PM, 0xBEEEF notifications@github.com wrote:
Thanks for the information. But now a question for understanding. Are these optimizations now related to the inlining in the Constructor, or is that something else again? I have reported problems in some places, and never consciously considered this option. Here I can of course deliberately turn it off for testing. But imagine that I am analyzing an application that is foreign to me. How can I determine whether this option was active? Just by inlining several constructors, or are there other symptoms that speak for it?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/cmu-sei/pharos/issues/131#issuecomment-699048729, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AL6ZAVH3SVMJCS6WOSWQWVTSHTFWPANCNFSM4QB5CRQA.
Just for clarity, I did not mean to imply that the constructor inlining problems were caused by link-time code generation option. This comment was in response to the one about mumble and speakClearly. I was also just remarking that if you were conducting tests (as you seem to be), then I would recommend that /LTCG be disabled for now. Obviously we're concerned that when analyzing some else's executable you can't control which compilation options were used, and we'll continue investigating the behavior of /LTCG. To see if it's disabled look for /GL or /LTCG in the options sent to the compile command. I think the answer is that it will be off in Cmake, and on by default in Visual Studio for Release builds (but not debug builds). I think adding /DEBUG is also an easy way to be reasonably sure that the option is off. Later edit: It's very unclear how to tell if was enabled in an executable that you did not compile.
I have great news! Tried out the changes you made in a big real example! And yes - it really works! The results are estimated to be 1000 times better with your corrections. The inline thing really seems to have been responsible for all the problems mentioned here.
Big praise to your whole team for the great work! It's really nice to see how much heart and soul has gone into this project.
That is really great to hear. Thank you for sharing!
Also, thank you for your contribution as well in reporting the problems. Your small examples have been extremely helpful for us in isolating the problems!
I think this is resolved. Please re-open if not.
I have taken a small example from the Internet, and slightly adapted it. There are two basic classes and one class that inherits from both. There are no special cases. Thus an example from everyday life. It must occur a thousand times.
What are the problems with this?
Well, let's say the main problems in this case are at the VFTables. Since the derived class has both VFTables, they are of course initialized and set via the structure. This is what I thought at first. Until I took a closer look and realized that unfortunately this is not the case. The first VFTable is recognized completely and without problems.
The second one, which Ghidra can also correctly assign, is just a normal member according to the analysis.
No distinction is made here.
Furthermore, none of the functions were resolved anywhere in the code for the second VFTable, such as here in the Main. In Ghidra I see clearly that a function is called via a memory address.
In my eyes there seems to be a general problem with multiple inheritance.