GrammaTech / retypd

GNU General Public License v3.0
69 stars 5 forks source link

How to get types for typevars in each function (like `v_7693`)? #7

Open am009 opened 11 months ago

am009 commented 11 months ago

I'm trying to write a decompiler using retypd. If I understand those python source code correctly, existing frontends (retypd-ghidra-plugin and gtirb-ddisasm-retypd) only generate function types (arguments and return value) for each function.

I also generated type constrains for local variables and registers. Is it possible to get the types of these variables (these v_xxx in the following example contrains)?

"memset": [
  "memset.in_0 <= v_1131",
  "v_5 <= memset.in_2",
  "bool <= v_1132",
  "memset.in_0 <= memset.out",
  "v_1135 <= v_1134.store.σ1@0*[nobound]",
  "v_1135 <= memset.in_0.store.σ1@0",
  "v_236 <= memset.in_2",
  "bool <= v_1136",
  "v_1135 <= v_1134.store.σ1@0*[nullterm]",
  "v_1135 <= memset.in_0.store.σ1@1",
  "v_1135 <= v_1134.store.σ1@-3",
  "v_1135 <= memset.in_0.store.σ1@2",
  "v_416 <= memset.in_2",
  "bool <= v_1137",
  "v_1135 <= v_1134.store.σ1@-4",
  "v_1135 <= memset.in_0.store.σ1@3",
  "v_161 <= memset.in_2",
  "bool <= v_1138",

Thanks in advance. Retypd is really powerful in type recovery. I'm quite grateful that it is open source.

aeflores commented 11 months ago

Hi @am009, getting types for local variables is possible but it would require some changes in Retypd. This is actually something that we have considered implementing, but haven't had time to do it yet.

It could be done as a post-processing step once the inter-procedural propagation has finished. Once we have types for function arguments and return values, you could propagate internally and get types for local variables. However, we probably don't want to do that for all variables in the program eagerly, but rather allow querying specific procedures (otherwise it would be potentially very costly).

If this is something you'd like to contribute, I'd be happy to provide more guidance. Otherwise, we will probably end up doing it ourselves, but I am not sure when.

am009 commented 11 months ago

Hi @aeflores, I urgently need this feature and I am really happy to contribute to retypd. I have been reading the code these days. The following are my understandings.

I think there are two important entries of the algorithm, one is the __call__ of Solver, and the other is the __call__ of CTypeGenerator. Solver solves constraints and the result is saved in Sketch. CTypeGenerator is responsible for recursively converting each Sketch into a primitive type, array type, struct type, or pointer type to them (in CTypeGenerator.c_type_from_nodeset).

I guess, after the __call__ of Solver is invoked, the Solver should provide a new API to allow the query of the Sketches for all variables (including local variables)?

To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.

By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?

aeflores commented 11 months ago

Hi @am009

I guess, after the call of Solver is invoked, the Solver should provide a new API to allow the query of the Sketches for all variables (including local variables)?

That's right.

To solve the infinite nodes problem caused by recursive types, retypd uses a method related to push-down systems. Is this part of knowledge related to papers like "Synchronized Pushdown Systems for Pointer and Data-Flow Analysis"? I wonder how the author came up with the idea and probably I need to learn more to understand the paper further.

I think you don't have to get into those details to get this working. A good place to look at is https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L808-L847

The basic steps for "solving" a procedure (or a scc with mutually recursive procedures) are the following:

  1. You get the constraints for the procedure https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L809 (these are the ones that you posted above)
  2. You get constraints from the procedure prototype (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L817) and from the prototypes of the callees (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L817)
  3. Then you call infer_shapes to generate the "skeleton" of the sketches for whatever variables you care about https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L830
  4. Then you call generate_primitive_constraints (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L836) and add_constraints (https://github.com/GrammaTech/retypd/blob/master/src/solver.py#L845) to populate the sketches nodes with primitive type information.

So let's say you want to get a sketch for a set of variables vars in a procedure proc, you would do something like this:

    def get_type_of_variables(self,
                            sketches_map: Dict[DerivedTypeVariable, Sketch],
                            type_schemes: Dict[DerivedTypeVariable, ConstraintSet],
                            proc:DerivedTypeVariable, 
                            vars:Set[DerivedTypeVariable]):
        constraints = self.program.proc_constraints.get(
                        proc, ConstraintSet()
                    )
        callees = set(networkx.DiGraph(self.program.callgraph).successors(proc))
        fresh_var_factory = FreshVarFactory()
        constraints |= Solver.instantiate_calls(
            callees,
            constraints,
            sketches_map,
            type_schemes,
            fresh_var_factory,
        )
        constraints |= sketches_map[proc].instantiate_sketch(
            proc, fresh_var_factory
        )

        var_sketches = self.infer_shapes(
            vars,
            self.program.types,
            constraints,
        )
        for var in vars:
            primitive_constraints = self._generate_primitive_constraints(
                constraints,
                frozenset({var}),
                self.program.types.internal_types)

            var_sketches[var].add_constraints(
                primitive_constraints
            )
        return var_sketches

NOTE I haven't actually run the code, it probably won't run as is, but hopefully it gives you an idea of what needs to happen.

In your example, you would call it with e.g. proc = "memset" and (for example) vars = {"v_1135","v_1132"} (it would actually be vars= { DerivedTypeVariable(var) for var in ["v_1135","v_1132"]} and same for proc). sketches_map and type_schemes would be the ones returned by the call method. Once you have sketches, you can obtain the corresponding C types. Hopefully this is enough to get you started.

This code recomputes a lot of stuff, doing that more efficiently would probably require more extensive changes.

By the way, because retypd may spend a lot of time on large programs, can we speed it up by using a faster interpreter other than CPython, like PyPy?

I think we tried running it with PyPy at some point, and it worked... or it was close to working, so yes that is definitely an option. There are also plenty of optimization opportunities in this code (cacheing some of the graphs, better data structures, etc).

aeflores commented 11 months ago

@am009 please note that we require signing a "Contributor License Agreement" https://github.com/GrammaTech/retypd/blob/master/GrammaTech-CLA-retypd.pdf before reviewing and accepting contributions into the official repo (the CLA should be sent to CLA@grammatech.com).

am009 commented 10 months ago

Sorry for the late reply. Busy with other random stuff these days. Now I finally settled down and started to learn something.

Really grateful for the detailed response, and it works. I created a PR for it: https://github.com/GrammaTech/retypd/pull/8 ( I sent a mail with a signed CLA on December 6, 2023.)

Actually, I'm very interested in binary type recovery and trying to do some research in this area, so I would like to learn more about the internals of retypd. I still find the retypd paper difficult to read. I guess I need to learn more about pushdown systems.

I still would like to know, what kind of pushdown systems is it related to? Do you have any recommended learning resources? (or any getting started introduction or tutorial about push-down systems) There is a reference to "Saturation algorithms for model checking pushdown systems." in the retypd paper. Probably it is related? Thanks in advance.

aeflores commented 9 months ago

Hi @am009 , thanks for the PR, I'll try to review it soon.

I still find the retypd paper difficult to read

Me too! @peteraldous and I were planning to write another document/paper revisiting the algorithm, but we never finished it.

I still would like to know, what kind of pushdown systems is it related to?

Despite the name, pushdown systems are not directly used in the implementation explicitly. If you are not doing that yet, I'd recommend focusing on the Arxiv version https://arxiv.org/pdf/1603.05495. Most of the implementation details are in the appendices.

In very rough terms, we have a type system, defined in Figure 3 of the paper, and we need to compute all the "implications" of our initial set of constraints (all the implied type constraints). We can do that by building a transducer that captures all possible type derivations. Appendix D describes how to build that transducer from an initial set of constraints. The construction of the transducer has several steps, and the pushdown system is basically an intermediate step. Again in very rough terms, Appendix D does the following steps:

Constraint set -> Pushdown system -> Transducer

In order to go from pushdown system to transducer, we:

  1. create a graph based on the pushdown system
  2. do saturation (Section D.3)
  3. and shadowing (D4) on the graph.
Constraint set -> Pushdown system ---graph+saturation+shadowing--> Tranducer

This implementation just goes from the constraint set to the graph directly. Saturation and shadowing in the implementation can be found here respectively:

aeflores commented 9 months ago

I sent a mail with a signed CLA on December 6, 2023.

Hi @am009 , It looks like we have not received your email. Would you mind sending it again?

am009 commented 9 months ago

Hi @am009 , It looks like we have not received your email. Would you mind sending it again?

OK, I have sent it again. image

am009 commented 5 months ago

Finally, I can understand most of the algorithm now, after reading "Saturation Algorithms for Model-checking Pushdown Systems" and "Advanced Topics in Types and Programming Languages" Chapter 10. Thanks for your previous explanations. The code is also very helpful. Now I am trying to reimplement the algorithm in Rust to have a better understanding of the algorithm.

am009 commented 5 months ago

I found a potential issue related to the algorithm, but it can also be a misunderstanding:

https://github.com/GrammaTech/retypd/blob/8f7f72be9a567731bb82636cc91d70a3551050bf/src/solver.py#L707-L711

There is an InferShapes function call during the simplification of constraints (InferProcTypes). I think the inferring of the sketches should be done in the next stage in the function InferTypes(while getting the real types), not here (during simplifying constraints).

Also, there are two arguments for InferShapes, the second argument is the main output map that maps variables to sketches. But here, it passes an empty set to it, which means the result is not used at all? So, I think the InferShapes call in the InferProcTypes is purely to apply the additive constraints. If there are no additive constraints, then the InferShapes call can be omitted. So the above call to InferShapes is not necessary?