NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
50.85k stars 5.8k forks source link

function.getParameters() not displaying accurate results? #2085

Closed TylerKann closed 4 years ago

TylerKann commented 4 years ago

sometimes getParameters() kind of does a bad job. I have a struct that I want it to recognize, but it doesnt recognize the struct. It doesnt really recognize strings either. It just says they are undefined8. Is there anyway to improve the fidelity of this?

astrelsky commented 4 years ago

Function.getParameters() returns the functions parameters as they have been defined. If you know what the parameters should be you need to set them as such.

TylerKann commented 4 years ago

struct str_t { char *buf; int len; }; unsigned int f(struct str_t *str) { unsigned int hash = 5381; unsigned int i = 0; for (i = 0; i < str->len; i++) { hash = ((hash << 5) + hash) + str->buf[i]; } return hash; } I dont fully understand your comment. Here i have a simple structure with a pointer to it. In a perfect world it would return struct str_t, but that wont happen. Even if it returned some type of char* that would be good. But it returns long ? I have no clue why this would happen?

astrelsky commented 4 years ago

Well no, it should return an unsigned int as per the function signature of f.

If you have an undefined function like below: original

You must change the signature to match what it should be: new

In the root of your ghidra installation there is a folder named docs. From there is another folder name GhidraClass. I recommend going through its contents.

TylerKann commented 4 years ago

Okay thanks, Ill check that out! The goal is doing this automatically (headless), so not too keen on individually modifying the functions. Why would it return unsigned int? That is what the function is returning not the input

astrelsky commented 4 years ago

Okay thanks, Ill check that out! The goal is doing this automatically (headless), so not too keen on individually modifying the functions. Why would it return unsigned int? That is what the function is returning not the input

Unfortunately at some point you're going to have to do some things manually if you need that level of detail.

When you said returning I thought you meant the functions return type instead of the parameter's type. Doing as shown above would make it the proper struct str_t * assuming you have created the struct str_t data type.

You can set the parameters in headless mode but it is not as convenient. It is likely easiest to create a FunctionDefinitionDataType and apply it to a function of your choice.

FunctionDefinitionDataType ApplyFunctionSignatureCmd

You can also use ApplyFunctionDataTypesCmd to apply the signatures in the provided datatype archives.

If you would like to get a lot out of this you could get the dbg libraries and create a datatype archive of all the function definitions and datatypes and then use the standard libraries to create a FunctionID database file (fidb). Then via healess analysis you could detect a ton of known functions which would get labeled and then apply the correct signatures/datatypes with the ApplyFunctionDataTypesCmd. This is just a thought though. It may or may not work out.

TylerKann commented 4 years ago
int main(int argc, char *argv[]) {
    struct str_t str;
    str.buf = argv[1];
    str.len = strlen(str.buf);

I have created the structure, yet it does not show that

astrelsky commented 4 years ago
int main(int argc, char *argv[]) {
    struct str_t str;
    str.buf = argv[1];
    str.len = strlen(str.buf);

I have created the structure, yet it does not show that

Please consider providing more information.

TylerKann commented 4 years ago
#include <stdlib.h>
#include <string.h>
struct str_t {
    char *buf;
    int len;
};

unsigned int f(struct str_t str) {
   unsigned int hash = 5381;
   unsigned int i    = 0;
   for (i = 0; i < str.len; i++)
   {
      hash = ((hash << 5) + hash) + str.buf[i];
   }
   return hash;
} 

int main(int argc, char *argv[]) {
    struct str_t str;
    str.buf = argv[1];
    str.len = strlen(str.buf);
    printf("%d \n", f(str));
    return EXIT_SUCCESS;
}

I get the function in python by doing a simple looping checking if function.getFunctionAfter(function) is ever the name of my function (hash). Once I find it, I "prime" my function by decompiling it. This is done by getting the HighFunction

    d = interface.decompileFunction(f, 5, None) 
    high = d.getHighFunction()

Once I have the HighFunction, i commitParamsToDB :

    y = ghidra.program.model.pcode.HighFunctionDBUtil
    y.commitParamsToDatabase
    y.commitReturnToDatabase
    y.commitParamsToDatabase(high, True, ghidra.program.model.symbol.SourceType.USER_DEFINED)
    y.commitReturnToDatabase(high, ghidra.program.model.symbol.SourceType.USER_DEFINED)

after this i should just be able to get f.getParameters(). It works like 70% of the time. But some cases I get bad results

astrelsky commented 4 years ago

Oh, you are using the high function for parameters.

I recommend avoiding directly committing to the database.

Here is the approach I recommend.

Results may vary

The above steps should yield reasonable results as the parameter types of known functions will have propagated through to data and then to non known function variables and parameters through references to the data and to the known function calls. There should be a method in the FlatProgramAPI to get a function by its name and namespace. You should use that method. If you don't need the decompiled function then you don't need to decompile it. You can check the function parameters and signature directly through the "low" Function class. Its parameters and return type will be identical to the high function one the parameters and return type have been committed to the database (done by the parameter id analyzer)

When you directly commit the parameters and return type to the database you are introducing a larger margin of error as the undefined datatypes will be propagated and the parameter id analyzer will skip over the function during analysis since everything is already committed.

I understand that your end goal is to do everything by headless analysis but you should still use the gui interface. It will get you familiar with how ghidra works as well as what analyzers and scripts are available and their options.

If you are uncertain how to do something and can't find the answer from a Google search you can send me an email before opening an issue if you'd like. I have it on my github profile publicly viewable. The main thing I'm not familiar with is headless analysis as I've never had the need for it. I don't do much in batches and it is difficult to truly follow data and function flow without looking at it.

TylerKann commented 4 years ago

Wow. This is a lot of information. Thanks a lot. I'll definitely take your advice and do it in the gui so I can get a feel for it. However, in the end, there shouldn't be anything preventing me from doing it headless right?

astrelsky commented 4 years ago

Wow. This is a lot of information. Thanks a lot. I'll definitely take your advice and do it in the gui so I can get a feel for it. However, in the end, there shouldn't be anything preventing me from doing it headless right?

There shouldn't be.

TylerKann commented 4 years ago

Thanks @astrelsky . You are quite the ghidra beast

TylerKann commented 4 years ago

@astrelsky Sorry, I haven't had the chance to get look at this until now. However,I am not having great results. Still using the aforementioned code, it is only returning a long. Here is what I did (note: i did the python part in ghidra_bridge) :

  1. Analysis with the stuff disabled that you said to have disabled
  2. Find the function 'f', and then using f create an FunctionDefinitionDataTypes object.
  3. Reanalyze with Data Reference (I dont have constant reference)
  4. Reanalyze with Call Convention Identification, Decompiler Parameter ID, Decompiler Switch Analysis, Shared Return Calls, Subroutine References.

After all that it just returned a long, not the struct I created

astrelsky commented 4 years ago

You are creating a FunctionDefinitionDataType from a function with the wrong signature. The definitions should be used for library functions and functions whose signature is known in advance. You need to make them ahead of time.

There is no way for ghidra to know about your struct unless it has been applied somewhere and determined to be a parameter/return type through type propagation.

TylerKann commented 4 years ago

You are creating a FunctionDefinitionDataType from a function with the wrong signature. The definitions should be used for library functions and functions whose signature is known in advance. You need to make them ahead of time.

So if I had this binary given to me out of thin air there would be no way to know? I guess I dont fully understand what you are saying, sorry.

astrelsky commented 4 years ago

You are creating a FunctionDefinitionDataType from a function with the wrong signature. The definitions should be used for library functions and functions whose signature is known in advance. You need to make them ahead of time.

So if I had this binary given to me out of thin air there would be no way to know? I guess I dont fully understand what you are saying, sorry.

Without any sort of debug information there would be no way to know.

TylerKann commented 4 years ago

@astrelsky would there be anything outside of ghidra that would work ? It sucks that it works like 70% but not perfectly

astrelsky commented 4 years ago

@astrelsky would there be anything outside of ghidra that would work ? It sucks that it works like 70% but not perfectly

No. You cannot expect anything to be able to pull 100% accurate information out of thin air.

There may be tools that could determine it to be a struct but it would be impossible for it's name and members names to be correct.

dragonmacher commented 4 years ago

@astrelsky has answered this thoroughly.

jahatfi commented 4 years ago

Reopening this issue, as I'm pretty sure there is in fact a bug (or is it a poor feature?) in getParameters(). This function returns false negatives (doesn't not detect actual parameters) when parameters are not copied onto the stack. For example, for something like this code:

#include<stdio.h>
void print(int a){
   printf("My number: %d\n", a);
}

void print_proxy(int b){
   print(b);
}

int main(void){
   int x = 5;
   print_proxy(x);
   return 0;
}

Ghidra's getParameters() seems to think that print_proxy() takes no parameters, because b is never copied onto the stack. Obviously my trivial example above is almost too simple to be of concern, but I have encountered real-world functions that never copy (some) parameters to the stack. I think this is a bug, not a feature. I got around this problem in a script by getting the decompiled code (which seems to be correct, even in this case), and just parsing the parameters out of the decompiled code string. Thoughts? Would you like me to take a crack at patching it?