Open am009 opened 11 months ago
Hi @am009, getting types for local variables is possible but it would require some changes in Retypd. This is actually something that we have considered implementing, but haven't had time to do it yet.
It could be done as a post-processing step once the inter-procedural propagation has finished. Once we have types for function arguments and return values, you could propagate internally and get types for local variables. However, we probably don't want to do that for all variables in the program eagerly, but rather allow querying specific procedures (otherwise it would be potentially very costly).
If this is something you'd like to contribute, I'd be happy to provide more guidance. Otherwise, we will probably end up doing it ourselves, but I am not sure when.
Hi @aeflores, I urgently need this feature and I am really happy to contribute to retypd. I have been reading the code these days. The following are my understandings.
I think there are two important entries of the algorithm, one is the __call__
of Solver
, and the other is the __call__
of CTypeGenerator
. Solver
solves constraints and the result is saved in Sketch
. CTypeGenerator
is responsible for recursively converting each Sketch
into a primitive type, array type, struct type, or pointer type to them (in CTypeGenerator.c_type_from_nodeset
).
I guess, after the __call__
of Solver
is invoked, the Solver
should provide a new API to allow the query of the Sketch
es for all variables (including local variables)?
To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.
By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?
Hi @am009
I guess, after the call of Solver is invoked, the Solver should provide a new API to allow the query of the Sketches for all variables (including local variables)?
That's right.
To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.
I think you don't have to get into those details to get this working. A good place to look at is https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L808-L847
The basic steps for "solving" a procedure (or a scc with mutually recursive procedures) are the following:
infer_shapes
to generate the "skeleton" of the sketches for whatever variables you care about https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L830 generate_primitive_constraints
(https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L836) and add_constraints
(https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L845) to populate the sketches nodes with primitive type information.So let's say you want to get a sketch for a set of variables vars
in a procedure proc
, you would do something like this:
def get_type_of_variables(self,
sketches_map: Dict[DerivedTypeVariable, Sketch],
type_schemes: Dict[DerivedTypeVariable, ConstraintSet],
proc:DerivedTypeVariable,
vars:Set[DerivedTypeVariable]):
constraints = self.program.proc_constraints.get(
proc, ConstraintSet()
)
callees = set(networkx.DiGraph(self.program.callgraph).successors(proc))
fresh_var_factory = FreshVarFactory()
constraints |= Solver.instantiate_calls(
callees,
constraints,
sketches_map,
type_schemes,
fresh_var_factory,
)
constraints |= sketches_map[proc].instantiate_sketch(
proc, fresh_var_factory
)
var_sketches = self.infer_shapes(
vars,
self.program.types,
constraints,
)
for var in vars:
primitive_constraints = self._generate_primitive_constraints(
constraints,
frozenset({var}),
self.program.types.internal_types)
var_sketches[var].add_constraints(
primitive_constraints
)
return var_sketches
NOTE I haven't actually run the code, it probably won't run as is, but hopefully it gives you an idea of what needs to happen.
In your example, you would call it with e.g. proc = "memset"
and (for example) vars = {"v_1135","v_1132"}
(it would actually be vars= { DerivedTypeVariable(var) for var in ["v_1135","v_1132"]}
and same for proc). sketches_map
and type_schemes
would be the ones returned by the call method. Once you have sketches, you can obtain the corresponding C types. Hopefully this is enough to get you started.
This code recomputes a lot of stuff, doing that more efficiently would probably require more extensive changes.
By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?
I think we tried running it with PyPy at some point, and it worked... or it was close to working, so yes that is definitely an option. There are also plenty of optimization opportunities in this code (cacheing some of the graphs, better data structures, etc).
@am009 please note that we require signing a "Contributor License Agreement" https://github.com/GrammaTech/retypd/blob/master/GrammaTech-CLA-retypd.pdf before reviewing and accepting contributions into the official repo (the CLA should be sent to CLA@grammatech.com).
Sorry for the late reply. Busy with other random stuff these days. Now I finally settled down and started to learn something.
Really grateful for the detailed response, and it works. I created a PR for it: https://github.com/GrammaTech/retypd/pull/8 ( I sent a mail with a signed CLA on December 6, 2023.)
Actually, I'm very interested in binary type recovery and trying to do some research in this area, so I would like to learn more about the internals of retypd. I still find the retypd paper difficult to read. I guess I need to learn more about pushdown systems.
I still would like to know, what kind of pushdown systems is it related to? Do you have any recommended learning resources? (or any getting started introduction or tutorial about push-down systems) There is a reference to "Saturation algorithms for model checking pushdown systems." in the retypd paper. Probably it is related? Thanks in advance.
Hi @am009 , thanks for the PR, I'll try to review it soon.
I still find the retypd paper difficult to read
Me too! @peteraldous and I were planning to write another document/paper revisiting the algorithm, but we never finished it.
I still would like to know, what kind of pushdown systems is it related to?
Despite the name, pushdown systems are not directly used in the implementation explicitly. If you are not doing that yet, I'd recommend focusing on the Arxiv version https://arxiv.org/pdf/1603.05495. Most of the implementation details are in the appendices.
In very rough terms, we have a type system, defined in Figure 3 of the paper, and we need to compute all the "implications" of our initial set of constraints (all the implied type constraints). We can do that by building a transducer that captures all possible type derivations. Appendix D describes how to build that transducer from an initial set of constraints. The construction of the transducer has several steps, and the pushdown system is basically an intermediate step. Again in very rough terms, Appendix D does the following steps:
Constraint set -> Pushdown system -> Transducer
In order to go from pushdown system to transducer, we:
Constraint set -> Pushdown system ---graph+saturation+shadowing--> Tranducer
This implementation just goes from the constraint set to the graph directly. Saturation and shadowing in the implementation can be found here respectively:
I sent a mail with a signed CLA on December 6, 2023.
Hi @am009 , It looks like we have not received your email. Would you mind sending it again?
Hi @am009 , It looks like we have not received your email. Would you mind sending it again?
OK, I have sent it again.
Finally, I can understand most of the algorithm now, after reading "Saturation Algorithms for Model-checking Pushdown Systems" and "Advanced Topics in Types and Programming Languages" Chapter 10. Thanks for your previous explanations. The code is also very helpful. Now I am trying to reimplement the algorithm in Rust to have a better understanding of the algorithm.
I found a potential issue related to the algorithm, but it can also be a misunderstanding:
There is an InferShapes
function call during the simplification of constraints (InferProcTypes
). I think the inferring of the sketches should be done in the next stage in the function InferTypes
(while getting the real types), not here (during simplifying constraints).
Also, there are two arguments for InferShapes
, the second argument is the main output map that maps variables to sketches. But here, it passes an empty set to it, which means the result is not used at all? So, I think the InferShapes
call in the InferProcTypes
is purely to apply the additive constraints. If there are no additive constraints, then the InferShapes
call can be omitted. So the above call to InferShapes
is not necessary?
I'm trying to write a decompiler using retypd. If I understand those python source code correctly, existing frontends (retypd-ghidra-plugin and gtirb-ddisasm-retypd) only generate function types (arguments and return value) for each function.
I also generated type constrains for local variables and registers. Is it possible to get the types of these variables (these
v_xxx
in the following example contrains)?Thanks in advance. Retypd is really powerful in type recovery. I'm quite grateful that it is open source.