Replace the data dependency by a context sensitive analysis

montyly commented 1 year ago

Right now the data dependency is context insensitive, which creates a large over approximation.

For example in

contract A{

    uint a;
    uint b;

    function f(uint x) internal returns(uint){
        return x;
    }

    function test1(uint paramA) public{
        a = f(paramA);
    }

    function test2(uint paramB) public{
        b = f(paramB);
    }

}

Slither will merge all the deps related to the call to f(x) when looking at the contract context. As a result, a dependency between a and paramB(or b and paramA) will be created, because the the analysis will merge all the callers of f :

$ slither test.sol --print data-dependency

Contract A
+----------+---------------------------+
| Variable |        Dependencies       |
+----------+---------------------------+
|    a     | ['paramA', 'paramB', 'x'] |
|    b     | ['paramA', 'paramB', 'x'] |
+----------+---------------------------+

Having a context-sensitive analysis will lead to bette results. This is also a recurring issues with top-level functions - which tend to be called from a lot of different contexts.

Moving toward a context sensitive analysis will have an impact on the performance. We could propose the two options (sensitive/insensitive), and allow the user to enable one or the other.

Additionally we should take the opportunity to refactor the data dependency to better support the switch between the context and the different source type: https://github.com/crytic/slither/blob/26659c4e0555c20eca037945aaec76b7639b00a7/slither/analyses/data_dependency/data_dependency.py#L47-L50

https://github.com/crytic/slither/blob/26659c4e0555c20eca037945aaec76b7639b00a7/slither/analyses/data_dependency/data_dependency.py#L62-L63

priyankabose commented 7 months ago

@montyly @0xalpharush I am planning to start implement this context sensitivity for data dependency. How should I proceed with that? Any specific process to follow here?

0xalpharush commented 7 months ago

I would implement it as a separate analysis and API for now, and we can decide whether to altogether replace the existing analysis or allow the detector author to choose one or another.

The first step would be to create an API for interprocedural control flow graph (CFG) as we currently only have an intraprocedural CFG. Some of the call-graph code may inspire the ICFG, but it does not correctly hand free functions currently (see https://github.com/crytic/slither/pull/1763).

We should create an edge in for each call-site and return aka "call and return linkages" in the CFG. For return linkages, we'd need to create a virtual exit in the CFG similar to how we have a virtual entry point. All nodes should have a path to the exit node. For example, we'd have a the following ICFG for the example in the issue: graphviz (2)

Right now, the taints for InternalCall are created by adding a entry in the Phi node with the arguments passed at each call-site here: https://github.com/crytic/slither/blob/13d7d9f66a6be4f798478fa3735fb63444b46c3d/slither/core/compilation_unit.py#L197-L207 This means f's value of x is either paramA or paramB and the taint propagates both i.e. the union.

To make this context sensitive, instead we would have the data dependency analysis lookup what are the assignments to the value of x in the calling context. That is, calls to f in test1's context should only consider paramA, f in test2's context should only consider paramB. This should also be reflected in the return value. To implement this, we can use the "call strings approach" to inter-procedural analysis. In practice, this would mean we keep in the data dependency context of f a second dictionary with test1 and test2 as keys. The key is a stack of yet-to-be-returned function calls i.e. f is invoked before test1 and test2 have completed executing. TBD what depth we should store for call strings. For params, we'd push the caller on to the stack and initialize the parameters in the entry block to the call value and expect the following dependency:

f = Function(f)
f.context["x"] ["test1"]= LocalVariable("paramA") 
f.context["x"] ["test2"] = LocalVariable("paramB")

For returns, we'd create a fresh variable, result_uuid, and change the return value into an assignment to this unique result variable. Returning would pop from the stack and for external functions the stack would be E, for empty. We'd expect the following dependency:

test1 = Function(test1)
test1.context["a"] ["E"] = LocalVariable("paramA") # transitive through result_test1

test2 = Function(test2)
test2.context["a"] ["E"] = LocalVariable("paramB")  # transitive through result_test2

We would need to compute the transitive closure for each calling context individually. The API is_dependent would need to accept another argument to allow detector authors to specify which call-site(s) to consider i.e. "test1" or "test2".

These lectures notes may prove useful and less abstract than the aforementioned paper: https://cs.au.dk/~amoeller/spa/7-interprocedural-analysis.pdf

priyankabose commented 7 months ago

@0xalpharush Thanks a lot for the detailed explanation. I think I get the essence of the idea. I will still go over the "call strings approach" and the corresponding lecture notes to jot down the concrete steps.

crytic / slither

Replace the data dependency by a context sensitive analysis #1742