awslabs / ar-go-tools

ar-go-tools (Argot) is a collection of analysis tools for Go
Apache License 2.0
5 stars 1 forks source link

Add field sensitivity to escape analysis #29

Closed amzn-jasonrk closed 9 months ago

amzn-jasonrk commented 9 months ago

This PR adds field sensitivity to the escape analysis, which goes a significant way towards preventing a blow-up of the graphs while analyzing more complicated programs. It also substantially improves precision, as it also gives us tuple- and bound variable sensitivity.

The technique works by allowing nodes to have "subnodes." These are nodes which are tightly linked to the parent node, and are used to represent (for now) fields and field-like features such as tuple slots. For example, if we have p := &S{a: &A{}, b: &B{}}, then we will have the following nodes:

p ---> new S
        | |
        | a ---> new A
        |
        b ---> new B

Here, a and b are subnodes of new S that represent the two fields of the struct. By making these nodes, we allow them to have both out-edges and in-edges, representing the field pointing at some other object, or some other pointer pointing at the field. This naturally allows some complex object graphs, such as this example from the standard library:

pp 
| |
| fmt
|   |
|   buf ----+
|           |
buf  <------+

Here, the pp object has a fmt field, which has a pointer to buffer field that is initialized to point at the buf field of the pp object itself so that methods on both append to the same buffer. This graph is natural to represent with subnodes, as the buf field of pp can be a target of the edge from the buf field of fmt. An earlier approach attempted to add labels to both ends of edges, but this ended up requiring a bunch of special cases. The representation is significantly clearer with subnodes.

The distinction between subnodes and just a normal edge between nodes is that a normal edge represents a pointer relationship, where the nodes do not need to be part of the same allocation, and more edges can be added through assignment. In contrast, a subnode relationship is created when the subnode is created and does not change. If the subnode would be created more than once, i.e. by accessing the same field multiple times in a method, the same subnode is returned each time.

More abstractly, nodes represent a set of locations in memory, such as all structs allocated on a particular line or all the objects that are the third argument to a function, etc. Crucially, we can view a struct either as an atomic value (the parent node), or as a collection of fields, which are locations in their own right. Subnodes are thus a systematic way to name a subset of the locations of some other node. To represent fields which are themselves structs, subnodes can be chained, so that the parent node is itself a subnode.

Subnodes allow us to more precisely represent the out-edges (pointer values) of an object by localizing those out-edges to come from particular fields, and more precisely represent in-edges by allowing those edges to point at a particular field. This increased precision allows us to typecheck the resulting graph using the static types of parameters, local values, and struct fields. Previously, if we took the address of a field of type int, as in t1 = &t0.f, then t1 would have type *int, but would be pointing at a struct, as there was no way to represent which field t1 is pointing at. Now, we create a subnode f of the pointee of t0, and t1 points at the subnode. Both t1 and f can be labeled with types, and we can check the consistency of these type labels to detect places where our analysis may have errors. This has resulted in discovering and fixing several bugs during development. Currently, only pointer values are typechecked, but this can be extended to most other types.

In addition to a few additional tests, this PR also makes the TaintEscapeIntegration test faster by a significant margin. Before, it would take at least 5 minutes on my machine, but now it takes ~70sec. Part of the remaining time is due to redundant summaries that could be pruned, and unnecessary iterations in the intra-function convergence loop, but these are future work.

The major remaining TODOs are: