cmu-sei / pharos

Automated static analysis tools for binary programs
Other
1.56k stars 190 forks source link

Future features [ooanalyzer] ? #109

Closed nemerle closed 3 years ago

nemerle commented 4 years ago

Hi, I've been having fun using OOAnalyzer on a few random executables and started wondering, is there some kind of plan for the future of this tool?

sei-ccohen commented 4 years ago

We have lots of plans. ;-) It's unclear which of them will become reality though. Obviously support for 64-bit Windows executables, Linux executables, GCC, better performance, better accuracy, etc. are all things we've discussed. In fact, there are so many plans that we can't really communicate them all clearly in an issue like this. If you'd like to ask about plans for a specific issue, please do.

sei-eschwartz commented 4 years ago

I'm glad you are having fun using OOAnalyzer. We've certainly been having fun creating it :-)

I agree with all the things that @sei-ccohen mentioned. For me, there are two high-level goals. I'd like OOAnalyzer to be a practical, useful tool reverse engineering tool. I think we're close to that for 32-bit MSABI binaries, but obviously the lack of 64-bit and Itanium ABI support is a problem. My second goal is to find the boundary of how much information can be recovered/inferred from binaries. If you look through the closed issues, you'll still see fairly regularly that one of our assumptions was broken. Overtime, we've been improving our understanding of C++ compilers, and as a result our accuracy has improved as well.

I still think there is room for improvement in accuracy, but it is becoming more and more difficult to improve our rules. This is because we've got most of the low hanging fruit, and now the mistakes we make are often in situations where an inference might be able to be made in some circumstances but not in others (e.g., the compiler inlined constructor/destructor calls). It is a frequent occurrence for one of us to propose a rule and not be able to fully think through all the contexts its used in to be correct. So I think we're starting to hit the limit of our ability to write these rules without some type of support. To that end, we are currently proposing a project to develop a proof system that allows us to reason about all the many contexts that can lead to particular code patterns. This should allow us to formally prove some rules as being correct, and perhaps more practically, to identify counterexample scenarios that we did not quite think of.

nemerle commented 4 years ago

Thanks for the answer, as for more specific things I was wondering if You plan/consider working on more interactive workflow bits:

sei-ccohen commented 4 years ago

Some of these topics we've addressed already and others we've thought about. I've reordered your list some to comment on each:

detecting array constructor calls ( new(sizeof(ObjectType)*count_arg) )

We don't currently handle arrays very well, but we should detect that array constructor correctly... I think. See commentary below on detecting new() and delete().

mark a function as ('new') since sometimes the analysis struggles with it ( especially in cases where new is a custom function ) :)

Agreed. Take a look at the --new-method and --delete-method options on the man page for OOAnalyzer. There was also a discussion in this issue about how to permanently improve OOAnalyzer qith missing new() and delete() detection issue.

mark a function as an allocator ('new') returning a specific size of memory.

There is some code for detecting allocation sizes right now (thisPtrAllocation), but it probably hasn't been tested as extensively as it should, and there's a known problem (with an unknown solution) detecting the data flow from size parameter into the new() call in some cases.

I've encountered many cases where ooanalyzer failed to recover the size of 'new' memory and produced thisPtrAllocation with Size==0, so it could be nice to allow us to iterate over all the locations with Size==0 in our IRE ( integrated reversing environment :P ) and set the values 'by-hand'

If you follow this procedure and edit the facts file between steps two and three, you should get a capability somewhat like you're suggesting.

marking purecall if ooanalyzer failed to find one.

We never added an option to specify purecall like new and delete, but we probably should. In fact, there's a design for a complete rewrite of the method identification system that we just haven't been able to find the resources to implement. We're also working on some purecall detection heuristics right now.

augmenting the facts from a plugin. Given initial set of ooanalyzer facts, we start ghidra/ida and augment it. For example:

For a more complete description of the facts and their meanings (to help with hand editing) see (this file)[https://github.com/cmu-sei/pharos/blob/master/share/prolog/oorules/facts.P].

mark a memory location as global class object

If the thisPtrAllocation() fact exporting did its job OOAnalyzer supports global objects. But you should be able to edit this fact to mark global objects and their sizes.

mark a function as returning an object pointer ( singleton pattern and friends )

This one's a little tricker than the usual edits, but you should be able to play some tricker with inventing a symbolic value (sv_xxx) and setting the function parameters.

COM object support? No idea if it's something interesting to do, but would be quite a challenge to get it working :)

This is the first topic you've raised where I can honestly say we've considered it, and then not seriously thought about much more. ;-)

providing known type layouts to augment processing? For example if someone was working on a specific target before trying ooanalyzer, they might've identified a few classes/structures already

You could "coach" OOAnalzyer by providing some trivially simple facts to build into a class, but at present, we don't really have a way to provide class definitions as inputs to OOAnalyzer. Interesting idea though.

full-on type recovery? smile

We conducted some experiments here, but obviously there's a lot of work involved. There were some "type rules" in the prolog directory that we recently removed because they didn't really work and they were confusing users. But if you're really interested you can go back a few commits and see what we did.

blazingly fast inference with user interaction ( although I have no idea how feasible is it to do in prolog)

Performance cotinues to be a significant challenge. It's improved dramatically since our first Prolog implementation and it might still improve some more. It's just hard. :-(

keep prolog server running with loaded facts and allow us to mark/pin function as belonging to a given class ( useful when method is mistakenly assigned to a parent class etc. ) and re-inference. I have no idea if the performance would be good enough for such an interactivity though.

Now that's an interesting new idea. It would obviously require the ability to start Prolog from within Ghidra, and leave it running (which we can do now from Pharos, but there's no GUI in Pharos). We'' definitely consider this suggestion in the design of any third-generation Ghidra plugin that runs OOAnalzyer inside Ghidra using facts from Ghidra. That's something we'd like to do, but it's also a big project.

nemerle commented 4 years ago

Agreed. Take a look at the --new-method and --delete-method options on the man page for OOAnalyzer. There was also a discussion in this issue about how to permanently improve OOAnalyzer qith missing new() and delete() detection issue.

Yup, I've used that functionality.

If you follow this procedure and edit the facts file between steps two and three, you should get a capability somewhat like you're suggesting.

Done that as well. Setting sizes helped with the 'sanity of the results, but the tediousness of looking up various assembly locations was the main reason I've started wondering about the possible plugin to set up initial things for ooanalyzer :)

We never added an option to specify purecall like new and delete, but we probably should. In fact, there's a design for a complete rewrite of the method identification system that we just haven't been able to find the resources to implement. We're also working on some purecall detection heuristics right now.

For a more complete description of the facts and their meanings (to help with hand editing) see (this file)[https://github.com/cmu-sei/pharos/blob/master/share/prolog/oorules/facts.P].

Yes, I've read that file quite a bit :)

mark a memory location as global class object

If the thisPtrAllocation() fact exporting did its job OOAnalyzer supports global objects. But you should be able to edit this fact to mark global objects and their sizes.

mark a function as returning an object pointer ( singleton pattern and friends )

This one's a little tricker than the usual edits, but you should be able to play some tricker with inventing a symbolic value (sv_xxx) and setting the function parameters. I will check how it works. Would it help if the marked function's return value was using the same sv_id as the next use of that pointer? ( basically doing dataflow analysis by-hand ?)

We conducted some experiments here, but obviously there's a lot of work involved. There were some "type rules" in the prolog directory that we recently removed because they didn't really work and they were confusing users. But if you're really interested you can go back a few commits and see what we did.

Performance cotinues to be a significant challenge. It's improved dramatically since our first Prolog implementation and it might still improve some more. It's just hard. :-(

I wonder if some kind of constraint solver could replace the prolog part?

sei-ccohen commented 4 years ago

So it sounds like you're fairly knowledgable about our internals then. A significant project that we've discussed internally but haven't really started on due to resource constraints is to write a Java extension in Ghidra that extracts the same facts as the Pharos-based OOAnalyzer tool. If we could get that working, it would open up all kinds of interesting possibilities. The current Pharos analysis was written before Ghidra was released, and some of the facts are pretty trivial to generate. Others are more complicated, but I concluded that all of the required facts should be possible.

Let us know if you have more specific questions about the facts, rules, etc.

sei-eschwartz commented 4 years ago

I think the idea of making OOAnalyzer more interactive is very interesting. @sei-cfc may have thought about that before, but I hadn't.

Using type information from IDA/Ghidra is an interesting idea, both for custom-defined types and better support for library function detection and functions like operator new.

Allowing the user to correct bad guesses is also an interesting idea. I will have to think about this some more, but at the moment it would probably require rerunning prolog from the point of the bad guess.

I wonder if some kind of constraint solver could replace the prolog part?

Probably not. The constraint problem evolves as we learn more information about the program, which is not something that most constraint solvers can handle. We could use a fixed point solver, but at the expense of prioritizing which guesses we would prefer to be made first, which would probably severely harm accuracy.

We have been working with the developer of SWI Prolog to develop new techniques to help us improve performance. If this pans out, it could improve our performance quite a bit.

0xBEEEF commented 4 years ago

So I think the ideas and views mentioned here are great! Now that I have played around with the tools extensively, I think they are great!

But I also agree with the suggestions to integrate the whole analysis process more into Ghidra. As a showcase project I can name the tool here. Offers a great class representation, extends the VTable resolution and makes everything much more readable.

Wouldn't that be a beginning or a common node for such a project? After all, some hours, days, if not weeks of work have gone into this.

If you would add your knowledge here to complete the whole thing, you would really have the ultimate class tool for analyzing OOP programs.

But what will definitely be required in the future is the use of the metadata provided by Ghidra. Also the decompiler seems to have a lot of additional information, which would be good in such a process. Theoretically, it should be possible to refine the results by means of the intervention possibilities and direct integration.

Is there already anything new to the above mentioned plans? I would be really happy if they would start soon.

In any case, no matter what the future brings, my respect for the development and provision of this project. It must have taken a really long time to develop, and it commands my respect. High praise to all involved!

sei-eschwartz commented 3 years ago

I'm cleaning up our open issues. There are some great ideas here. But I'd rather see issues for specific features that we could then tag as enhancements.