cisco-open / llvm-crash-analyzer

llvm crash analysis
Apache License 2.0
40 stars 17 forks source link

Taint Analysis of functions out of the backtrace #36

Open niktesic opened 1 year ago

niktesic commented 1 year ago

Currently, the Taint Analysis is not performed for functions out of the backtrace, unless, one of the two conditions is met (from TaintAnalysis::shouldAnalyzeCall:): 1) Return value of the function is in the Taint List (return value register is in the T.L. or it is a base register of memory location from the T.L.) 2) A global variable is in the Taint List (Taint Info with no register operand, but with offset is in the T.L.) Those two conditions need to be revisited to meet real case scenarios and for condition 2) the future support of global variable tracking would be beneficial.

On the other hand, in many real cases, the parameter is passed as a reference (pointer) and its value is set in the functions out of the backtrace, but we don't have mechanism to detect such cases an to perform Taint Analysis for such functions.

With the patch below, we are able to run Taint Analysis on each function out of the backtrace, by selecting argument -analyze-each-call. This could be used during investigation to inspect how Taint Analysis could be performed on such functions, but in real cases, it could cause explosion of analysis. Patch: analyze-each-call.patch

Please, consider the following test case:

void set(int* adr) {
  *adr = 5; // crash line
}
void init(int** p) {
 *p = 0; // correct blame line
}
int main() {
  int* ptr = 0; // incorrect blame line
  init(&ptr);
  set(ptr);
  return 0;
}

Although, argument -analyze-each-call is used and we are be able to analyze function init(), which is responsible for setting incorrect value of the pointer, the tool is not able to find correct blame line. The main reason is the fact that we don't have available register values for the frames out of the backtrace, so we cannot rely on concrete memory addresses. This means, that functions out of the backtrace are analyzed on symbolic level, so we need to match exact registers and offsets.

To sum up, there are two mechanisms which need to be developed: 1) Mechanism which will determine when a parameter is a reference to the tainted location, where we should analyze the call 2) Mechanism to efficiently perform analysis of functions out of the backtrace, without available register values (using memory to find needed values and improving symbolic level analysis)

niktesic commented 1 year ago

Some updates from the current Proof-of-concept investigation.

There are several different scenarios to take into consideration:

  1. parameters passed via registers vs parameters passed via stack
  2. one level of call vs multiple level of nested calls (out of the backtrace)
  3. blame line is in the call (out of the backtrace) vs blame line is not in the call (out of the backtrace), but the flow of the Tainted Value goes through it

I've made a certain progress with the basic test case, which corresponds to the register passed parameter, one level of nesting and blame line inside the call.

About developed mechanisms:

  1. Added support for getting parameter forwarding register set from CrashAnalyzer TargetInfo
  2. Improve RegisterEquivalence to handle LEA instructions (including changes in TargetInstrInfo)
  3. Implemented detection if any of the parameter forwarding register is Tainted (this includes equivalent locations) and checking if the dereference level is less than 0 (which should imply that the parameter is a reference to the Tainted location)
  4. Implemented FowardTaintAnalysis for functions with tainted parameter reference (can detect loading of the constant in the tainted parameter, like in the init function, in the test case)
  5. Implemented detection of instruction which loaded the parameter into register
    • some additional fixes in TaintDFG management

With those changes the basic case is covered, and the TaintDFG looks like: dfg3

niktesic commented 1 year ago

The second case to consider is when one parameter is used to set the value of the other, like in the test below (function fun):


void crash(int val, int* adr){
    *adr = val; // crash - line 3
}

void fun(int** ptr, int* adr)
{
    *ptr = adr; // wrong blame - line 8
}

int main(){
  int *p = 0; // wrong blame - line 12
  int *adr = 3; // correct blame - line 13
  fun(&p, adr);
  crash(1, p);
  return 0;
}

In this case we can use the ForwardTaintAnalysis to track the parameter to the "set point" (*ptr = adr;), which corresponds to the following MIR instruction:

MOV64mr $rax, 1, $noreg, 0, $noreg, $rcx, debug-location !DILocation(line: 8

From this point we can perform the existing Backwards TaintAnalysis to track the value of parameter adr (from $rcx). In that way the resulting TaintDFG looks like: dfg3

However, currently, TaintDataFlowGraph analysis fails to find the correct blame node from the graph, which is yet to be investigated.